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SYSTEM FOR DATA MANAGEMENT 



10 FIELD OF THE INVENTION 

The present invention provides a means by which one can integrate data from a variety of 
databases each with its own content, organization and structure into a single repository. The user 
can then retrieve and display the integrated data in its original form. 

BACKGROUND OF THE INVENTION 

15 Others have tried to cross-index data across the original source data, but only achieved an ability 
to index the data. Retrieval was then left to the original source application. As well, this approach 
did little to normalize the data, and achieved only limited Integration. Conventional wisdom 
frowned upon integrating the data into a single repository as too labor-intensive and difficult. As 
well, until recently, few technologies were available to manage datatiases of this size with the 
' 20 necessary speed and flexibility required by a general query program. 

BACKGROUND OF THE ART 



Several patents exist which may be relevant to the present invention. Such U.S. Patent Nos. are: 



25 4205371 

4908759 
5426780 
5450581 
5515534 

30 5566332 

5596746 
5600826 
5721912 
5740421 

35 5764973 

5778375 
5787433 
5940832 
5999937 

40 6014670 

6016501 
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DESCRIPTION OF THE PREFERRED EMBODIMENTS 

The present Invention assists in classifying, organizing and integrating raw infomiation of all types 
into a form that permits the retrieval and analysis of the infomiation by means of a simple user 
interface. Once integrated Into the database and classified into the poly-hierarchical tree 
5 structure, it is possible to request Information using relationships that did not previously exist In 
the source data. 

Once the data has been integrated into a repository specifically designed to be extensible 
(designed in a manner similar to a meta-data database) the problem of integrating disparate 
databases becomes considerably easier. Retrieval and classification of the data then becomes 

10 the next problem to be solved. To do this, a poly-hierarchlcal tree-structure was implemented 
l^l^Sl^^P^^^'^^®^ ^ number of Important and beneficial side effects. The poly-hierarchical tree 
( PHT ) IS a data structure with branches and leaves. The branches describe the hierarchy (topics 
and sub-topics), and the leaves describe the information nodes. The 'pol/ portion of the name 
refers to the fact that a node (either a branch or a leaf) can appear more than once in the 

15 hierarchy. So to use an example, the Toxicology' branch appears below, the HEALTH branch as 
well as the CHEMICAL PROPERTIES branch, and in both locations, it leads to a sub-hierarch^ of 
branches and leaves that are below It. The poly-hierarchlcal tree provides an inherent conte)rt to 
each data node that can be used to enhance and optimize text searches and other relationships 
to the data node. 

20 The present invention can operated on various hardware configurations. A preferred component 
list IS as follows. One Unix Server services the Oracle Database Engine with two database 
instances: a transactional database and a data repository database: a second server (Unix or NT) 
services the WebLogIc Web Application Server. In a fault-tolerant production environment each 
Server is duplicated. The two WebLogic Servers are operated in load-balanclng/fail-over mode 

25 and are connected to one of the two Database Servers containing the Primary Transactlori 
Database. The Secondary Transaction database Is operated in a hot standby mode continuallv 
updated with the archive logs from the Primary Transaction Database. Both of the Data 
Repository Databases are connected In load-balancing/fail-over mode to the Primary Transaction 
Database, which passes along the data requests generated by the Web Application Server. 

30 One can search across granulated data as related to the text/images. Text can be processed 
(parsed) to extract the infomiation that allows It to be Integrated Into the database i e the 
substance name, manufacturer. CAS number. Physical State, or the like. Images, on 'the "other 
hand are preferably accompanied by an index containing the necessary infomiation In order to be 
able to integrate them Into the database. 

35 The context features are used to refine a full-text search. For instance, if a user has descended to 
Toxicology from the 'Health' node, there may be some peer nodes in its immediate vicinity 
relating to woricplace standards, exposure limits, First Aid Indications, and the Wke These 
neighbors provide a context with which to weight the results of a full-text search They might also 
define a specific collection of Information within which to search for results, excluding a more 

40 general search that would retum unrelated documents or data. If the user had reached 
Toxicology' from the 'Chemical Properties* node, the neighbouring branches and leaves mlaht 
weight the search towards other property information, such as. but not limited to. Reactivity 
Radioactivity, or the like, which would result in a different set of results. 

Ai. f "f^'^J.^tP®?* P^^®®"* invention is the use of a PHT type of a structure in conjunction with 
45 hJll-text indexing and retrieval techniques. It is believed that such use within an RDBMS context to 
dassify and sort the meta-data (Leaf descriptions, or what would be referred to as columns In a 
traditional 2-dimenslonal data table) in the metabase has heretofore not been disclosed. 

As wall, the concept of designing the metabase to be able to receive and integrate data from 
disparate sources with undetermined formats and structures is also innovative. The integration 

' 002 
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provides a possibility which was not present in the disparate databases taken alone: the ability to 
perform combinatorial searches across all of the disparate data taken together. Furthermore, 
these searches are not related to a single technology (i.e., either e relational (SQL) search or a 
full-text search, or a bit-mapped query) alone - they can now be combined and manipulated 
5 together. 

The present invention also provides a process for integrating data from a variety of databases 
each with its own content, organization and structure into a single repository, whereby the user 
can then retrieve and display the integrated data in its original form, comprising: 

a. transferring data from a source medium into a staging area; 



10 b. converting sakj data into a database format; 

c. mapping said data into a master Index Tree; 

d. normalizing or de-nomnalizing at least a portion of said data Into a format suitable 
for parsing and integration into a target database; 

e. parsing said data to extract granular data; 

15 f. exporting said data of step e. into said target database; 

g. formatting said data in a normalized form; 

h. verifying said target database to ensure integrity has been maintained and 
reproduction has t>een accurate; 

1. archiving said target database to a storage media; 

20 j. merging said data into a master database repository; 

k. exporting said data from step j. to a pre-production instance; and, 



I. de-normalizing said data fram step k. so as to optimize access time in an online 
environment. 



Additional disclosure material is attached hereto and incorporated herein. 

25 White the invention has been described in connection with certain preferred 

embodiments, it is not intended to limit the scope of the invention to the particular forms 
set forth, but, on the contrary, it is intended to cover such alternatives, modifications, and 
equivalents as may be included withiri the true spirit and scope of the Invention as 
defined by the appended claims. All patents, applications and publications referred to 

30 herein are hereby Incorporated by reference in their entirety. 



003 

3 



BNSCKDCID: <WO 0159613A2_L> 



wo 01/59613 



PCT/IBO 1/00369 



EXPRESS MAIL LABEL NO. EL660499681US 



Raw Source Data 




Staged Database 



ay 




staged Database 
(reformatted) 



The data is moved from the source media 
and imported into a staging area. Where 
appropriate, the data is also converted into 
a format suitable for, and then imported Into 
a Microsoft^ Access database. 



The resulting data, in database fomiat 
where appropriate, is then analyzed and 
mapped into the master Index Tree. Where 
necessary, the data is de-normalized or 
normalized into a format suitable for parsing 
and integration into the target database. 
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Staged Database 
(Parsed) , 




Target Database 
(Containing data 
from a single source 
and version) 



Validation 
(Data Integrity & 
Accuracy against 
Source Data) 



Once tlie data has been massaged into a 
form from which it can be manipulated, it 
undergoes a parsing process to extract 
granular data from any raw text that has 
been specified during the mapping process. 



When the data has been fully prepared and 
verified, it is then exported into a target 
Oracle database. At this point it is formatted 
In a highiy normalized form, which facilitates 
the creation of relationships between data 
elements that were previously independent 
from one another. 



The resulting target database is then 
verified to ensure that the integrity of the 
data has been maintained throughout the 
integration process, and is also verified 
against the original source data to ensure 
accuracy of reproduction. 
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Once the target database is verified arid 
approved, it is archived to a permanent 
media. The data is then merged into the 
master database repository. 



IV^aster Database 
Repository 




The data in the Master Database is 
exported to a Pre-Production instance and 
de-normalized in order to optimize access 
time in an on-line environment. 



Pre-Production 
Database 




The Pre-Production database is used as a 
fine! testing area and serves as a source 
database for replication to the various on- 
fine databases. 



On-line Databases 
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Database Architecture 



DataCHEST will be storing data from a wide variety of providers. Inevitably, each provider wilt have their 
own preferred structure for organizing their data. Our challenge is to design a database architecture that is 
flexible enough to acconnmodate the importing of data from all of the different provider databases. 
However, the architecture must also be capable of indexing the data so that it can easily be queried and 
retrieved by our search engine with fast response times and minimal bandwidth usage. 

Two choices present themselves: to create a tailor-made, proprietary structure within which to store our 
providers' data, or to design a database architecture that can be implemented within a third-party database 
and Internet/Intranet Web environment. 

Some of the commercially available database and Web applications that we are able to consider include ; 

- Microsoft SQL Server and Internet Information Server 
Oracle 

Informix 

- Sybase and Power Tools 
Microsoft Access 
Microsoft FoxPro 

- Dbase 

- IBM's DB2 

- Object-Oriented Databases (O2. Objectivity, etc.) 

- Atrion International's Chemmate® proprietary database engine 

- Showbase Extra and Showbase Collaborative (httD://vww.showbase.com) 

- Sand Technology's Nucleus Data Warehouse Server (httD://www.sandtechnoloqv.com) 

A common thread runs through nearly all of the above databases. They are, for the most part. ODBC- 
compliant, and are able to use an SQL-like language to query and manipulate their data. Unfortunately, 
this ability works against the very flexibility that we are looking for in order to organize and store our data. 

Among the exceptions in the above list is the proprietary database engine developed by Atrion for Its 
Chemmate® software. Although the development of Chemmate® began in the eariy 1990's, when ODBC 
was not yet universally accepted, nor fully defined, a variety of commercial database applications existed at 
the time that could have been utilized. Atrion chose instead to develop their own engine, primarily because 
the nature of their data made it awkward to store in a traditional 2-dimensional relational database table. 

The structure of the data required for Chemmate* or more precisely the lack thereof, provided the impetus 
for the Atrion designers to embartc on creating their own proprietary solution. Eventually, market pressure 
forced the company to develop an ODBC interface to enable the application to store Us data within an 
Enterprise Database environment Many parallels exist between that situation, and the one being faced by 
DataCHEST today. A close examination of their decision, especially in light of the tools available today, 
would be vefv instructive in our current situation. 

Other solutions noted towards the end of the above list also employ a non-standard relational structure. 
The ShowBase Extra and Collaborative applications are able to import from ODBC-compliant data sources, 
but rely on their proprietary architecture for improved query and access speed. As these products are 
designed primarily for publishing data on the Internet from off-line data sources, this strategy is an 
acceptable one for our application. 

The information contained in this document is proprietary to DataCHEST.com Inc. conridential and intended only for internal use or for 
communication with DataCHEST suppliers and clients. ^H^i 
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The Sand Technology Nucleus application is a revolutionary new ODBC-compliant database engine with a 
unique, proprietary underlying structure, which was designed to optimize the storage of typical data in data 
warehousing environments. In a nutshell, It converts all values to binary tokens (somewhat akin to 
Chemmate® ELID's), creating tables which map tokens to their original values, and then creates a 
compressed database consisting only of tokens. The resulting compressed database Is again transformed 
into binary matrices of value sets, which are again compressed. The binary nature of the resulting 
structure is much more efficient for searching, and consequently. Nucleus automatically indexes all fields In 
the table, nearly always without a speed penalty. 

The present situation 

The following statements characterize some of the different aspects of the data that will be stored in our 
database: 

- There will be many (several thousand) different fields (table columns) relating to each 
substance. (We intend to combine the data from ail of our providers into a unified format, 
accessible from a single user-interface.) 

- There may often be more than one source and corresponding value for a data element. (The 
contents of a cell can be multi-valued.) 

- Not all field values will be present for every record. (The table will be sparsely populated.) 

- The format and size of the data contained in a table cell is indeterminate. It could be 
anything from a simple Boolean value (True/False), a numeric, or a text string, to a complex 
array of values, a bit-mapped Image, a text document or a hyperlink. 

- The databases are multi-lingual. 

- The size of the database and the number of data elements will be extremely large. 

- The data itself will be completely static and user access will be limited to read-only. Updates 
will occur off-line on a standby database, which will be toggled to an on-line state when the 

update is complete. 

The data can be separated into two independent categories: 

- Texts of law or policy (which may relate to classes of substances, but rarely to a single 
substance in particular). 

- Substance-related data (Properties, presence on a list, MSDSs and other related 
documents, etc.) 

At the moment, it is believed that the data architecture requirements for storing texts of law and policy are 
well served by existing full-text indexing technology, and will not require special study. 

To speed up the initial product release, it would be acceptable to index substance-related data by only a 
few selected fields. These would include fields like chemical name, CAS number, EINECS number, 
ELINCS number, UN Number, lATA number, chemical formula, molecular structure, supplier name and 
part number, etc. In other words, only by fields that serve to identify a substance rather than those that 
characterize it. 

This will permit users to search for desired substances and to retrieve desired data relating to those 
substances, which encompasses the bulk of the present demands on the data from the user population. 



The information ccntatned in this document is proprietary to DataCHEST.com Inc.. confidential and Intended only for internal use or for 
communication with DataCHEST supplieis and c6ent6. 
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However, as the user population grows, it is believed that new uses and applications will require that all of 
the data fields be indexed. This would permit the application to rapidly handle queries like 'Retrieve all 
products whose flash point is above 30^ C, whose specific gravity is less thar} 1,5 and which are not iisted 
in EU regulations as R27 in concentrations greater than 14%\ 

For that reason, it is preferred to begin with a robust database environment that can provide the tools 
necessary for DataCHEST to develop a basic package quickly, but which can also support the 
development of a more sophisticated application over the medium term. 



A potential Data Model of relationships has been included below. It should be noted that the design Is far 
from final, and that additions and changes will be made as we consult various experts, and as the project 
progresses. 

As is evident from the figure, nearly ail of the tables relate either to the Substance Table, which contains 
Identification fields only (the ones shown are for example only, and more will be added), or to the Field 
Description table. Actual substance-related data is contained in a set of tables distinguished by data type. 
Data elements are uniquely identified by a substance, an Authority, a Language (where applicable), and a 
date of last update of the data element (when the element was updated, not when the source of the data 
was last updated). 

A Field Table fists the various fields that have been defined for substance-related data. Each field has a 
unique data type, which is related to the data table In which the data is stored. For clarity, the Data Model 
is shown with four major data types (Boolean, Numeric, Text, and Unstructured), but additional types are 
also contemplated. Examples of these would t>e numeric ranges, hyperlinks, and other types of data. 

Field Descriptions are organized into multi-level tree-structure of Groups for easier searching and selection. 

An Authority Table lists the various Information Providers that supply the data in the database, together 
with the version and date of last update of their source database. 

A Language Table lists the various supported languages and is keyed to those tables that are able to hold 
multilingual data. 

A Field Mapping Table lists the mapping between Field Descriptions in our Database and the field 
description and source table in the Information Provider's native database. 

A Synonym Table provides an important tool to facilitate searching for substances. Substance names are 
often vague, referring at times to a unique substance and at other times, referring to a class of substances. 
By allowing duplicate Synonyms, the Synonym Table structure will provide for this condition. 

A Composition table provides a means by which it is possible to represent mixtures by listing each 
ingredient (which itself must exist in the Substance Table) together with its concentration. Note that the 
data will simply be imported from the Infonnatlon Provider. No attempt will be made to ensure that the total 
ingredient concentration will add up to 100%. 



The information contained in this document is proprietary to DataCHEST.com Inc., confidential and intended only for internal use fi^ for 
communication with DataCHEST suppliers and clients. WIWl 
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User Interface 

Ideally, the user should have two modes in which to access the data: 
- Interactively, with results displayed in HTML format in his browser window, or 
. Enflelded, with results returned in an ODBC/JDBC-compatible table that can be stored in a local 
database or in a delimited flat file. 

DataCHEST will promote a 'shopping list' approach for the user interface. 

The initial choice will permit the selection between Documents (Texts of Law or Policy, bit-mapped irr les, 
etc.) and Substance-related Data. As indicated earlier, the document retrieval procedure will utilize off- 
the-shelf Full-text Indexing Search Engine and will be dealt with in a separate document 

The next steps for retrieving Substance-related data can be performed in any order, and will resemr the 
Wizard' approach common In the latest 32-bit applications on the market. 

- The user must identify whether his query will be Interactive or Enfielcad. 

- The user must specify t^i product selection criteria (e.g. «All pr ducts whose name 
contains the text -DlMETriYL"», or «AII matches for CAS # "r :-00-r». or simply 
«TRICHLOROBENZENE»). 

. jhe user must specify which Data Elements (fields or properties) t j wishes to retrieve, 
(e g. Boiling point, Melting point, Flash point, Specific Gravity, European Annex 1 status 
and concentration limits. Presence on TSCA list and reportable quantities, etc.) 

- In many cases, the user will have the option of specifying which Provlder(s) should be 
consulted. This can be done at the time of the query, but an option will be provided to 
set up preferences, priorities and methods in advance, (e.g. «Consult ChemTox first, 
then LoLi. then NCEC» or «Take the average of the values provided by ChemTox, LoLI 
and NCEC» or even aReturn the range across all of the values provided, excluding any 
elements that deviate from the mean by more than one standard deviation*) 

- The user will also be able to filter the results according to the date of last update. This 
will pennit restricting a search to exclude data that was already retrieved by a previous 
request, and/or requesting data from a specific period or version that was published by 
the Information Provider, (e.g. «Select,data from ISIS Q4*1995. last updated between 
9/16/1995 and 1 2/31/1 995».) 

- The user will be able 1o specify which Languages to retrieve for textual or language- 
specific data elements, (e.g. aRetrieve Japanese language MSDS documents for 
"Acetic Acid"».) 

- Lastly, the user will have the option of requesting that the application calculate the fees 
to be charged for the transaction, and displaying them for approval prior to returning the 
request. In this case, the result set will be cached at the server for an appropriate period 
of time so that the query will not have to be repeated. 

Once these steps have been completed, the query will be performed and the results returned in the desired 
format. 

The ideal user interface would be a combination of a drill-down menu-driven structure, or Query Wizard, 
that would display the various options available at every step of the process, and a natural-language 
component that would allow a relatively flexible fomnat for expressing a query. 



The information containod in this document i:: proprietary to DatsCHEST.com Inc.. confidential and intended only for internal use or for 
communication with DataCHEST suppliers and clients 
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Particular attention will have to be made to the definition & grouping of the Field Descriptions. The large 
number of potential field names will give rise to confusion. Rather than being related to the data 
architecture and systems design, this issue is actually an operational one. Nev\^ fields will have to be 
defined for each database that is added, and these fields will have to have a common naming convention 
and be organized in common groups if they are to be compared with each other. Given that each 
Infomnation Provider will have their own Field Description, we will have to be very disciplined in assigning 
our own. 

Not to belabor an example, but, if a user is looking for Boiling Points, and the data is available from 5 
different sources, but in varying formats, there may be 3 or four different Field Descriptions to describe the 
information. Consider, for example, the following data set- 




By describing them in the above fashion, the values are grouped together and will be displayed next to 
each other, so that the user can see all of the available values cleariy and unambiguously. 



The infonmatjon contained tn this document is proprietary to OataCHEST.com Inc.. confidential and intended only for internal use or for 



communication with DataCI4£ST suppliers and clients. 
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If the Field Descriptions were less orderly, the data could be spread out across a very large table, and the 
user would not necessarily know that related data exists under a different heading. Again, as an example, 
look at the following: 




In the above example, because the Field Descriptions are not assigned in an ordered fashion, the related 
data is spread out all across the data set and is much more difficult to comprehend, let alone permit the 
user to see if related data is available. 



Queries 

Apart from the two methods of accessing the data discussed above, there are so far two types of queries 
for substance-related data that can be envisaged at this time: 

- Encyclopedic (i.e. ""Provide the toxicological properties of specified substances relating to 
mutagenicity in rats "). 

- Analytical {I.e. "Which furan-based chemicals on the Toxic Reiease Inventory list are vented into 
the atmosphere in the Southeastern United States in quantities greater than 10 tons/year ") 

Document data (non substance-related) can be queried using a classical full-text search and retrieval 
method. {Find all documents relating to maritime transportation of petrochemicals within the jurisdiction of 
navigable Canadian waters.) 

Certain types of documents, for example MSDSs, merit being queried by both methods. (Select all MSDSs 
for substances that are not on the Canadian Domestic Substances List that contain the phrase 'flush eyes 
with water' in the First-Aid section.) 

A mock-up of a proposed screen layout for the DataCHEST Query Wizard can be found below. 

The Query Wizard will be eventually Implemented as a Java Applet that will run within a compatible 

browser, such as Netscape or MsIE. The Wizard consists of a main window with 3 principal sub-windows. 

The main window contains various menu bars and toolbars, consisting of buttons and controls to assist in 
navigating within the sub-windows. 

The information contained \n this document is proprietary lo DataCHEST.com Inc.. confid«nUal and intended only for interna) use or for 
oonimunicdtion with DataCHEST suppliers and cGenU. Q 



<WO 01S0613A2 I > 



wo 01/59613 



PCT/IBOl/00369 



The first and second sub-windows, respectively entitled Shopping Mall and Data Store, are intended to 
function much like MsWindows Explorer, mainly so that most users will already be familiar with the 
principles of Its operation. The function of these two windows will be to assist the user in building the 
desired query segments. 

The third sub-window, entitled Shopping Cart, will hold the query segments as each one is completed and 
will enable the user to process the final query. 

As with MsWindows Explorer, the Shopping Mall sut>-window is organized as a tree structure containing 
the field groups and descriptions contained in the DataCHEST database. Unlike Explorer however, the 
different levels of the tree are variable, and can be selected by the tab control at the bottom of the 
sub-window. The user can thus choose the order in which to drill down through the levels of the tree. The 
buttons will also pre-select an ordering of the tree levels. The search control on the toolbar shown at the 
top of the main window will retrieve all field descriptions and groups whose contents match that of the 
search window, according to the currently selected ordering of the tree. 

The Data Store sub-window will change according to the type of field that is currently selected in the 
Shopping Mall sub-window. Obviously, there will be a number of pre-defined window types, with 
parameters for naming the various selection controls. In the example below, a configuration that might be 
suitable for selecting a substance has been shown. A numeric data value may call up a configuration that 
allows the user to specify a range within which the value must fall. A text data value may call up a 
configuration that allows the user to specify a string that should contained within, begin or end the data 
value. Other data value types may have other configurations. 

In some instances it will be useful for the user to be able to ask for the number of data elements which 
match the constraints Imposed as a query segment is built. A button will be provided for this purpose 
within the Data Store sub-window. Under certain circumstances a fee may be imposed for retrieving this 
information. 

The Shopping Cart sub-window will hold the contents of each sut>-query. As with the Data Store 
suk>-window, there will be a button control that will retrieve the hit count, although in this instance, it is more 
likely that a fee will be imposed for retrieving this Information. Another button at the bottom of the 
sub-window will calculate the cost of retrieving the query (this will not be chargeable). A final set of buttons 
will retrieve the data elements associated with the query. The first button will retrieve the query in a visual 
format as a Web Page. The other button will retrieve the query as a result data set that can be stored 
locally on the user's workstation, as long as the DataCHEST client software has been purchased and 
installed. 
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Functional Organization 



The following diagram describes our preliminary view of how the various parts of the database will be 
organized. Obviously, until the choice of tools has been confirmed, the functional organization cannot be 
frozen. 

Figure 3: Functional Organization 
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Clients will access the DataCHEST server from their Internet Web Browser (1). From there, they will 
select which type of query they want to perform, either an interactive one displaying results on the screen, 
or a query which will return a result data set as an SQL table. They can then design their query 
interactively using a Wizard, or else choose to supply their own SQL statements and expressions. 

In each case, a Java or ActiveX applet will be downloaded to their browser (2) to execute the query wizard 
or receive the SQL statements. Interactive queries will be passed from the Web Server (3) to an HTML 
translator (4) that will transmit the query to the Data Warehouse (5). The Data Warehouse will process the 
query and retreive the result set. passing it back to the HTML translator where it will be formatted into an 
HTML page (6) that will be passed back to the Web Server and sent (7) to the Web Browser. 

SQL table queries will connect directly to a Database connection server (8) that will maintain and control 
the concurrent sessions, threads and access security, passing the quer> to the Data Warehouse (5) and 
returning the result set to the Browser Applet (2). The Browser Applet will in turn store the result set in a 
local database (9). 

Periodically, as the Information Providers update their data, their data sets (A) will be imported into the 
master database in the Data Warehouse (5). A suitable tool (B) will be used to map the source data onto 
the master target, so that the process can be repeated each time, making only those changes to the 
established mappings that are required to reflect updates to the structure of the source databases from the 
Information Providers. 

The possibility of collaborating with remote databases (C). maintained by the Information Providers at their 
location should also be available, although it is doubtful if this level of complexity can be achieved at the 
outset. To do this would require a Foreign Database Linkage module (D). 
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Abstract 

This document describes the data model employed by DataCHEST for its Data Repository 

Design Goals 

The DataCHEST Data Repository is a data store for data origirating from a wide variety of sources. As such, 
it must provide a simple and cohesive structure in y/nich to catalogue and index the transfomied source data. 
The DataCHEST Data Repository is intended to be a reference source rather than a transactional database. 
As such, the data must be easily retrievable by a wide variety of criteria, instead of being organised acconiing 
to a predetermined logic. 

Data Types 

The Integration of each nev/ database Into the Data Repository has the potential to justify the need for the 
creation of a new data type in order to manage the data to be stored. Hopefully, the process of transfonning 
and importing data from the initial set of databases that have been evaluated will result in the definition of 
sufficient variety of data types to handle the transfomiation of the majority of databases that will be 
encountered during the first year of deployment 

Each of the data types contemplated so far are listed and discussed below. 

Enfielded, relational data elements 

The primary characteristic of this type of data that will be imported into the DataCHEST Data Repository is 
that it is substance-related. The data elements will always relate to a particular substance, for example, 
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• the boiling point of Ethanol, or 

• the fact that Freon belongs to the family of chlorinated hydrocarbons, or 

• that Freon Is listed as an environmental hazard for atmospheric release under the US Federal Clean 
Air Act. 

Enfieided. relational data is by far the most complicated type of data that will have to be transformed into a 
unified structure. The primary reason for this is that the source data already exists within its own structure, 
and is already linked to other elements by the relationships within that structure. 

The key to integrating data from these various dissimilar structures is to create a database schema in which 
the data element definitions are themselves data. This strategy ensures that the data element definitions are 
thus extensible and flexible enough to accommodate each new type of data element and each new structure 
as new data sources are imported into the Data Repository. 

Data in this category will fall into four major types: Boolean. Numeric. Text Strings and Large Objects. 
Boolean data 

Boolean data represents an either-or binary value. In the context of the DataCIHEST Data Repository, the 
data represented will have meanings like True/False, Yes/No. and Presence on or Absence from a list. 

Numeric data 

Numeric data will be represented as a floating-point value. There does not seem to be enough differentiation 
to warrant the segregation of integer values in a separate table. 

Text strings 

Text strings will be limited to a length of 255 characters in order to remain compatible with all of the common 
ODBC-compatible database engines available on the market today. Textual database elements that exceed 
this size will be stored as Large Objects, even though their indexing process might require some time- 
consuming conversions or the use of a full-text search engine. 

Text Strings will also be used to store hyperlink data elements, or pointers to objects stored elsewhere. 
Large objects 

Large objects will be used to store data that, although it is relational, cannot be stored in the three 
conventional data types discussed above. Some of the data types that will fall into this category are 
discussed below. 

It has not yet been decided whether to store all large objects in the Data Repository within the Database 
Engine itself, or else as files in a file system. The indexing and linking processes should be able to support 
either option. 

Object data elements 

It will be difficult, in certain instances, to convert the data into common forms that will permit direct comparison 
of values between providers. Some providers store numeric data in numeric form, while other have chosen to 
store numeric data as text in order to handle exceptions or unusual values. DataCHEST must endeavour, in 
addition to publishing the data in the providers' original format, to do Its utmost to provide uniform data 
formats wherever possible. 

Thus, for simple numeric attributes, such as boiling points, there will not be one simple numeric element type 
that can completely describe the possible values. Rather, there will have to t>e a series of elements, 
organised into an object structure, which completely describe the attribute. A possible example of an object 
structure for describing boiling points is provided t>ek}w. 

1 . Numeric value for the temperature 

2. Unit of measure Token (whk^h pc^nts to a multi-llngual Ust of values. e.g. *C. 'F, X etc.) 

(Or else convert all values to a common unit of measure arid re-convert them to a specific unit of measure upon retrieval) 

3. State Change Token (which points to a multHlngual list of values, e.g. Bolls. Decomposes. Explodes, etc.) 

4. Numeric value for the pressure at which the measurement was taken 
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5. Unit of measure Token (points to a mulU-lingual list of values, e.g. mm/Hg. psi. KG/cm*. atmospheres, etc.) 

While this may seem sufficient to cover the various easily imaginable cases, there will inevitably be 
exceptions to even these broad specifications. For that purpose, we have included with every data element, 
the option of Including a remark or comment in which the circumstances of the particular exception can be 
explained. 

Unstructured textual data elements 

For the most part, unstructured textual data elements are not substance-related, and consist of texts of law. 
regulations, scientific and medical research studies, news reports and other documents. These data 
elements will be indexed using a full-text indexing search engine. 

On the other hand, a substantial number of textual data elements, such as export certificates, patents, and 
MSDSs, are substance-related. These documents can be indexed in two ways: as full-text and as hyperlink 
values in a relational database. 

Semi-structured textual data elements 

r^SDSs are also representative of semi-structured data elements in that they can be divided into meaningful 
zones or areas of text that have a certain significance. For instance, in some MSDS formats. Section 2 might 
contain Infonnatlon about the ingredients of mixtures, as well as physical and chemical properties, while 
Section 4 might contain First Aid Procedures. By pre-processing these documents prior to storage in the 
repository. DataCHEST can include zone markers that will permit users to search within a specific zone inside 
each Msbs. This capability can add valuable infomiation to the context of a query. 

Spatial data elements 

Although this type of information may not be useful in the initial releases of the DataCHEST Data Repository, 
later releases that include this data type will provide valuable functionality to the users. Using spatial 
information, users will not only be able to select regional data elements by clicking on maps, but they will also 
be able to do things like build diagrams of 2D and 3D representations of molecules or molecule fragments 
and search for matches across the substance database. Statistical or transactional data could also be 
summarised by region and displayed on a map using a coloured intensity-scale. 

Image data elements 

Various types of elements data fall into this category, such as pictograms, scanned documents, satellite 
photographs and X-ray images. For the most part, these data elements will have to be Indexed using 
keywords, added either manually, or derived from the file name and/or source directory. One notable 
exception to this rule is the case of scanned documents that could be processed through an OCR (Optical 
Character Recognition) engine and full-text indexed. 

Other data element types 

Despite the relatively exhaustive list of standard data element types described above, there are still other data 
types that will have to be handled In the DataCHEST Data Repository. Audio and video clips, Multimedia 
objects, statistical and transactional data will all find their way into the DataCHEST melting pot following the 
initial release of the application. 

Multi-lingual data 

The textual data elements and documents contained in the DataCHEST Data Repository originate from ail 
over the world, and are written in different languages. Both the Data Repository and the Query Wizard must 
be able to search across these data elements, regardless of language, and must also be able to manage and 
display the data. 

The schema of the Data Repository must thus be structured so that all of the text elements can be identified 
by language, and furthermore, so that they can be translated and stored In more than one language. 

Schema 
Schema Diagrams 
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The Schema of the Data Repository is diagrammed In the two figures below. The first figure includes the 
schema with the language relationships displayed, and demonstrates the level of complexity that is added by 
the multi-lingual nature of the data elements. 

Figure 1 - DataCHEST Data Repository Schema, with language relationships displayed 




The second figure, below, diagrams the same ^hema without displaying the language relationships, making 
the structure much more easily discemible. 
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Figure 2 - DataCHEST Data Repository Schema, with language relationships hidden 



' Ficlottonships 
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Schema Description 

The schema is centered on the Data Elements table, which uniquely Identifies, but does not contain, each 
data element. Data elements are identified by a Data Element Identifier (DEID) and are stored in separate 
tables according to their data type, and where applicable, the language in which they are written. This 
enables the repository to manage data that exists In or has been translated Into multiple languages, for 
example. First Aid Procedures. MSDS's or TREMCards. 

Each data element is uniquely identified by three criteria. A Unique Substance Identifier (USID) relates the 
data element to a specific substance, a Field Token relates to a specific Field Description which defines the 
data element, and an Authority/Version Tolcen specifies the source of the data element. 

Data elements can also reference individual Remarks contained in their own table. These are comments that 
would apply to the individual data element, noting an exception to the general rule for the definition of the data 
element, for example, that a boiling point for carbon dioxide canncst be obtained at standard pressures 
because the substance evaporates directly from its solid state. 

The Substance table uniquely Identifies each substance In the Data Repository. Because a substance can 
have different properties depending upon its Physical State or who the Manufacturer is. there could 
frequently be more than one record in the database for a given chemical compound. To facilitate searching 
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and retrieval, attributes that uniquely identify a substance will be contained in the Substance table (the ones 
shown are for example only, and more will be added). 

The Substance Names table provides an important tool to facilitate searching for substances. Once again, 
due to multi-lingual considerations, the names are contained in a separate table. It is extremely common for a 
substance to have multiple names, and this characteristic is handled by referring each name to an entry in the 
Synonym Types table. 

Substance Names are often vague, referring at times to a unique substance and at other tmes, referring to a 
class of substances. By allowing duplicate Substance Names, the structure will provide for this condition. If 
the duplication of names proves to be unworkable, the relationship between Substance and Substance 
Names can be modified to implement a many-to-many relationship. 

A Composition table provides a means by which it is possible to represent mixtures by listing each ingredient 
(which itself must exist in the Substance Table) together with its concentration. Note that the data will simply 
be imported from the Information Provider. No attempt will be made to ensure that the total ingredient 
concentration wilt add up to 100%. Concentrations can be specified In any desired unit of measure, as 
referenced by the Units of Measure Table. 

A Field Descriptions table identifies the various fields that have been defined for substance-related data. To 
handle multi-lingual Field Names, these values are stored in a separate table. Each Field Description is 
associated with a unique Data Type, which is related to the data table in which the data is stored. For clarity, 
the Data Model is shown with four major Data Types (Boolean, Numeric. Text, and Unstructured), but 
additional types are also contemplated. Examples of these would be numeric ranges, hyperlinks, and other 
types of data. 

The Field Grouping table organises the Field Descriptions into the multi-level tree-structure for easier 
searching and selection. 

An AuthorityA/ersion Table identifies the reference source (Authority Names) of the data as well as the 
Information Provider that supplied the data, together with the version and date of last update of their source 
database. 

A Languages Table lists the various supported languages and is keyed to those tables that are able to hold 
multilingual data. 
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Abstract 

This document sets out the design specifications for the development end implementation of a Knowledge 
Management and Distribution System (KMDS) to facilitate the importing of data to the DataCHEST data 
repository at the back end and the retrieval of the data from the repository at the front end. The system will 
also manage the client billing, the supplier royalty payments and the transaction processing and analysis. 
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Oenormalized Table Creation cican.doc 1 



01. DATA gLBHBWTS Table Cgeatioa Script 



CREATE TABLE OL_^DATA_^ELEMENTS 
(ELEM_IO NUMBERdO) NOT NULL 
.SFID NUMBER(7) NOT NULL 
.VERSION_TOKEN NUMBER NOT NULL 
.FIELD_^TOKEN NUMBER(7) NOT NULL 
.DEID NUMBER(12) NOT NULL 
.PARENT_DEIO NUMBER(12) NOT NULL 
.FIELD_COST NUMBER(8.2) NOT NULL 
.FIELD_PRICE NUMBER(8.2) NOT NULL 
.EFFEaiVE DATE 
.EXPIRY DATE 

, INTEGRATED DATE NOT NULL 

.RMK^TOKEN NUMBERCIZ) 

.MFRID NUMBER(12) NOT NULL 

.PH5TJ0KEN NUMBER(12) NOT NULL 

.U5ID NUMBER(7) NOT NULL 

.MFR NAME VARCHAR2(255) NOT NULL 

.PHYSICAL_STATE VARCHAR2(255) NOT NULL 

.SUSVERSION TOKEN NUMBER NOT NULL 

.CAS VARCHAR2(20) NOT NULL 

.5RCDB_ID VARCHAR2(255) NOT NULL 

.SDBID^DESCR VARCHAR2(4000) 

.DC DATE DATE NOT NULL 

.DTATYP_TOKEN NUMBERO) NOT NULL 

.PV^TOKEN NUMBER(12) NOT NULL 

.AV^TOKEN NUMB£R(12) NOT NULL 

.AUTH_TOKEN NUMBER(12) NOT NULL 

.AV$DC_VERSION VARCHAR2(i00) NOT NULL 

.AVSEFFECTIVE DATE 

.AVSEXPIRY DATE 

.AVSCREATED DATE NOT NULL 

.AUTH^NAME VARCHAR2(255) NOT NULL 

.PV$DC_VERSION VARCHAR2(100) NOT NULL 

.PROVJOKEN NUMBER(12) NOT NULL 

.PVSEFFECTIVE DATE 

.PVSEXPIRY DATE 

.PVSCREATED DATE NOT NULL 

.PROV_NAME VARCHAR2(255) NOT NULL 

.PROV_PRODUCT_NAME VARCHAR2(255) NOT NULL 

.LTID NUMBER(7) 

.FD$ICONS_TOKEN NUMBER(5) 

.FD$DTATYP_TOKEN NUMBERO) NOT NULL 

) 



Priaiary Key 



ALTER TABLE OL DATA_ELEMENTS 
ADD CONSTRAINT OL DATAJLEMENT5_PK PRIMARY KEY 
(ELEN ID) ^ 



Note : The $ aign in a field name is use to separate the table ALIAS from the field 
name used in the creation SQL acatement Cor a field name Chat is ccmmon in two tables 
(i.e. PROVIDER_V£R5ION « PV » EXPIRY date vs AUTHOR ITy_VBRS ION « AN » EXPIRY date) 



028 



BNSDOCID: <WO 01S96iaA2_L> 



wo 01/59613 



PCT/IBOi/00369 



Oenormalized l abLe Creation clean.doc 2 



Data loading logic 

The OL (On Line) Daca Element table ia populated from the MMN database using the 
£ollowin9 tables {Note s Not all £±eldB oi a given table may be uaod) : 



Table Names 
AUTHORITY_^NAWIE 

AUTH VERSIONS 



DATA ELEMENTS 



ELEMS 

FiELD_DESCRIPTIONS 

INFO^VERSIONS 

MANUFACTURERS 

PHYSICAL_STATES 

PROVIDERS 



Field Names 

AUTH TOKEN 
AUTH~NAME 

AV^TOKEN 
AUTH^TOKEN 
DC^VERSION 
EFFECTIVE 
EXPIRY 
CREATED 

ELEMJD 
SFID 

version^token 
field token 
deid"" 

parent^deid 
field_cost 
field^price 
effective 

EXPIRY 

integrated 
rmk_token 

deid 

DTATYP^TOKEN 

field^token 
ltid 

icons_token 
dtatyp_token 

VERSION_TOKEN 
PV TOKEN 

av"token 

MFRID 
MFR^NAME 

phst_token 
physical^state 

PROV_TOKEN 

PR0V_NAME 

ADDRESS 

PROV.PRODUCT_NAME 
CONTACT 



Reld Descriptors 



NOT NULL 
NOT NULL 

NOT NULL 
NOT NULL 
NOT NULL 
NOT NULL 

NOT NULL 



NOT 
NOT 
NOT 
NOT 
NOT 
NOT 
NOT 
NOT 



NULL 
NULL 
NULL 
NULL 
NULL 
NULL 
NULL 
NULL 



NOT NULL 



NOT NULL 
NOT NULL 

NOT NULL 
NOT NULL 

NOT NULL 

NOT NULL 
NOT NULL 
NOT NULL 

NOT NULL 
NOT NULL 

NOT NULL 
NOT NULL 

NOT NULL 
NOT NULL 



NUMBER(12) 
VARCHAR2(2551 

NUMBER(12) 

NUMBER(12) 

VARCHAR2(1001 

DATE 

DATE 

DATE 

NUMBER(12) 

NUMBER(7) 

NUMBER 

NUMBER(7) 

NUMBER(12) 

NUMBERS 12} 

NUMBER(8,2) 

NUMBER(8,2) 

DATE 

DATE 

DATE 

NUMBER(12) 

NUMBER! 12} 
NUMBERO) 

NUMBER(7) 
NUMBER(7) 
NUMBER(5) 
NUMBERO) 

NUMBER 
NUMBER(12) 
NUMBERO 2) 

NUMBER(12) 
VARCHAR2(2&5) 

NUMBERn2) 
VARCHAR2(255) 

NUMBER<12) 

VARCHAR2(255| 

VARCHAR2{255) 

VARCHAR2(255I 

VARCHAR2(255) 
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PROVIDER VERSIONS 



SUBSTANCES 



SUBST FORMS 



SUBST NAMES 



PV TOKEN 


NOT NULL 


NUMBER(12) 


DCl^VERSION 


NOT NULL 


VARCHAR2riOO) 


PROV TOKEN 


NOT NULL 


NUMBERd 2) 


EFFECTIVE 




DATE 


EXPIRY 




DATE 


CREATED 


NOT NULL 


DATE 


USID 


NOT NULL 


NUMBER!?) 


VERSION TOKEN 


NOT NULL 


NUMBER 


CAS 


NOT NULL 


. VARCHAR2(20) 


SRCDBJD 


NOT NULL 


VARCHAR2(255) 


SDBID PESCR 




VARCHAR2(4000) 


DC.DATE 


NOT NULL 


DATE 


SFID 


NOT NULL 


NUMBER!?) 


MFRIO 


NOT NULL 


NUMBER! 12) 


PHST TOKEN 


NOT NULL 


NUMBERd 2) 


USID 


NOT NULL 


NUMBER!?) 


LANG TOKEN 


NOT NULL 


NUMBERO) 


NAME TOKEN 


NOT NULL 


NUMBER!9) 


SORT NAME 


NOT NULL 


VARCHAR2!255) 


NAME DESCRIPTION 




VARCHAR2(4000) 


DATE CREATED 


NOT NULL 


DATE 


VERSION TOKEN 


NOT NULL 


NUMBER 


SYN TOKEN 


NOT NULL 


NUMBER!?) 
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Creation SQL etatement 

. fh^ data the following SQL statement must be executed using a tursor 
'^^""l^Vorll"^^^^^^ - -oid Rollbac. Segn^^nta to cras^-) 



SELECT OE.ELEM^IO. 
DE.SFID. 

DE. VERS ION JOKEN. 
OE.FIELDJOKEN. 
DE.OEID. 
DE.PARENT^DEID. 
DE. FIELD COST. 
DE.FIELO_PRICE. 
DE. EFFECTIVE. 
DE. EXPIRY. 
DE. INTEGRATED. 
DE.RMK JOKEN. 
SF.MFRIO. 
5F.PHST_T0KEN. 
SF.USID. 
MA.MFR NAME. 
PS.PHYSICALJTATE. 
SU. VERSION JOKEN SU$ VERS I ON JOKEN. 
SU.CAS. 
SU.SRCOBJD. 
SU.SDBID^DESCR. 
SU-DC DATE. 
EL. DTATYP JOKEN. 
IV.PVJOKEN. 
IV.AVJOKEN. 
AV.AUTH JOKEN. 

AV OCJERSION AV$DC VERSION. 
AV. EFFECTIVE AVSEFFECTIVE. 
AV. EXPIRY AVSEXPIRY. 
AV. CREATED AVSCREATED. 
AN.AUTH NAME. 

PV.XJERSION PVIDC_VERSION. 

PV PROV JOKEN. 
PV. EFFECTIVE PVSEFFECTIVE. 
PV. EXPIRY PVSEXPIRY. 
PV. CREATED PV$CREATED. 
PR. PROV NAME. 
PR.PROVJRODUCTJAME. 

FD.LTID. 

FD ICONS TOKEN FD$ I CONS JOKEN. 
FD. DTATYP JOKEN FOIDTATYP JOKEN 
FROM SUBSTJORMS SF. 
MANUFACTURERS MA. 
PHYSICAL^STATES PS. 
SUBSTANCES SU. 
ELEMS EL. 
INFOJERSIONS IV. 
AUTHJERSIONS AV. 
AUTHORITY JAME AN. 
PROVIDER_VERSIONS PV. 
PROVIDERS PR. 
FIELD DESCRIPTIONS FD. 
DATA ELEMENTS DE 
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WHERE DE.SFID - SF.SFID 
M'D SF.MFRID = MA.MFRID 
AND SF.PHST TOKEN = PS.PHST TOKEN 
AND SF.USID"- SU.USIO 
AND DE.OEID = EL.OEIO 
AND DE.VER5I0N_T0KEN - IV. VERSIONJOKEN 
AND IV.AVJOKEN « AV.AV_TOKEN 
AND AV.AUTH^TOKEN = AN . AUTH__TOKEN 
AND IV.PV TOKEN = PV.PV^TOKEN 
AND DE.F1ELD_T0KEN - FO.FIELD^TOKEN 
AND PV.PROV TOKEN - PR . PROV^TOKEN : 



Quality Check 

To ensure that Che On Line Data Element table has been populated correctly, the 
fallowing rules or guidelines muse be verified : 

1. The number of records in the DATA^ELEMEtrrS table must be the same as the target 
0£j_DATA_ELEMENTS table. 

2. If the number of record created Is less (re£. rule #1) a Foreign Key Integrity 
Constraint has been violated when generating the data (i.e. a foreign exist in one 
table but not in the referenced table) 

3. If Che number ot record equals 0, one (or more) of the table is empty (Same 
violation aa rule #3 but easier c6 identify}.. 

4. If the number of record in the DATA_ELEHEtlTS is less than the newly generated 
OLi_DATA_BLEMENTS tables, it means that a Primary Key constraint as been violated 
(or disabled) . 
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Abstract 

This document sets out the design specifications for the development and implementation of a 
Knowledge Management and Distribution System (KMDS) to facilitate the importing of data to the 
DataCHEST data repository at the back end and the retrieval of the data from the repository at 
the front end. The system will also manage the client billing, the supplier royalty payments and 
the transaction processing and analysis. 



Minimum set of capabilities required for the application 

10 The minimum set of operational capabilities required for the DataCHEST IMADS is listed below. 

1. The IMADS must be able to store, index, and retrieve all types of data formats. These can 
range from simple Boolean (True/False). Numeric (Integer or Floating point), or Text values, 
to a complex array of values (Objects], bit-mapped images, formatted documents or other 
data types. 

15 2. The IMADS must be able to process multiple languages and character sets. DataCHEST 

source databases will originate from countries throughout the world, and must be accessible 
to any client who requires the data. DataCHEST clients will understand many different 
languages and must be able to use the IMADS In their native language if their business case 
warrants translation of the user interface. 

20 3. The size of the database and the number of data elements it will contain will be extremely 

large. 

4. The IMADS must be capat>le of being mirrored to multiple remote sites both in toto and in 

subsets. 

5. The IMADS client must have an open architecture and be able to be executed from al) of the 
25 popular platforms, including at a minimum, MsWindows (3.1. '95. '98, 'NT). MacOS (System 7 

and higher), and Unix (with a graphics terminal). Support for other platforms, for example 
IBM AS/400, would be desirable. Of course it might be impractical to assume that the user 
Interface would function properly on a character-based terminal. 

6. The IMADS server must also have an open architecture. Its modular nature provides the 
30 flexibility for certain application segments to run on different platforms. While key 

components, like the transaction server or the Web server might be limited to running on Unix 
or Windows NT Server, the Data Repository should be capable of being run in a variety of 
environments 

7. The data transformation tool must be able to import data from all popular database platforms, 
35 flat file fomnats, and popular spreadsheet and word-processor formats. 

8. The Query building process must be Intuitive, convenient and simple to use. 

9. Query retrieval must offer comfortable response times. The user interface should be able to 
analyse user queries and warn the user if his request is unusually complex or large, or if it will 
take a long time to execute. Ideally, such requests should be able to t>e deferred and run 

40 remotely, with large result data sets being returned off-line, either by wire or by hard media, 

such as CD/ROM or tape. 
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Account 
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Definition 

The following is how a definition should appear in documents. 

\ A special kiifd of document thdt pfp}^ides 6^^ for shaping apnal document 



Defined terms should be Indexed. 

The definition should appear in a glossary at the end of a document. 
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Glossary : , 

To fill in the glossary, do the following: 

1 . create a bookmark including the defined term and its definition. 

2. in the Glossary section, insert a cross-reference to the bookmark. 

Tempfate 

A special kind of document that provides basic tools for shaping a final document. 
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Abstract 



This document describes the various types of entries contained within the DataCHEST Index 
Tree. 



5 Field Topics and the DataCHEST Index Tree 

Due to the extremely large numt>er of fields In the DataCHEST repository, we have created a 
^ representation that we call a "poly-hierarchical tree-structured index", or Index Tree in order to 

organize the different data fields stored within the repository. (For a technical explanation of the 
DataCHEST Index Tree, please refer to the document entitled DataCHEST Query Wizard.) 

10 The screen-capture below is a representation of a portion of the DataCHEST Index Tree. 
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-DAt«chesl 

-Chemicftk & siitstances ; TieU Topic 
■Health: Field Topic 

TSCA - Seel 5le) • New Chemical EKposue Uimis (NC£Ls) : Ffee-fofma» Tejd 

NIOSH • Health Slandards - Pesbrides ; Free-forrnat Text 

- TSCA • Seel 8 • Health and Safety Reportir^ list ; Fiee-lamat Teid 

i Calif omia ♦ Obectors U$i of Hazaidous Subsiances (8 CCR 339) ; Fiee^onnat Text 

i - EPA • Cafdnogcn Hazard RarkinQ foi RQ AdjurtmenJ ; Look-up labia 

^ TSCA • Sed 8 • Health and Safety Repoitiiv List ; Slnjclued Ofaiecl 

- Toxicotogy : Field Topic 
B- — Exposuie Limits ; Fidd T opic 

{^—.Workplace Sefety Standards & Regulations ; FieU Topic 
E- Skin Sensitivity : Field Topic 
IS — Chemicai Test Rules and Methods : Field Topic 



'lent : Field Topic 



m- 

li^ .. - Safely (Hazards) : Field Topic 

^ Trantportaiion ; Field Topic 

f^ ' RecriontrFridTopie 

— North Amenca ; Field Topic 
^ — Asia^adTc ; FieU Topic 
1^ — Latin Ametica (South, CentiaQ : Field Topic 
^ — Europe : Ftetd Topic 

International ; FieJd T cpc 
(S — Middte East : Field Topic 
ffl- • ReoLdaioiy Lists : Ftdd Topic 
^- ■ Undassified Fields : Field Topic . 



;| Window 



Figure 1 : Screen Capture of a section of the DataCHEST Index Tree. 



The entries in the Index Tree can be categorized into three main groups: 
1 . Field Topics, 
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2. Data Fields, and 

3. Structured Objects. 



Field Topics 



10 



15 



Field Topics are used to classify and organize the entries within the tree. An appropriate analogy 
would be to consider them Hke sub-directories on a computer's hard drive, or folders in a filing 
cabinet. In Figure 1 above, under the DataCHEST root, the first five Field Topics represent the 
CHEST acronym : Chemicals, Health, Environment, Safety, and Transportation. 

Field Topics can themselves be contained within other field topics. In Figure 1 above, under the 
Field Topic "Regions", the principal economic continental areas are listed, together with a heading 
for International, covering data fields that relate across political or economic boundaries. Figure 
2. below, displays the same freld topic. "Regions", but it is now expanded to show the sub-topic 
"Europe", with two members, "European Union" and "United Kingdom", Below the topic 
"European Union", some of the member countries are listed. If one of the member countries were 
to be expanded, the Data Fields classified under that country would appear, together with any 
further field topics that might apply, such as a state, province, region or city. 



. TcmpRcad 




- Dalaches) 

- Chemicals tt substances : Field Topic 
I — Heafth;Fidd Topic 

i^ - Envifonmeni : Field Topic 

I — Safety (Hazards) : Field Topic 
— Transportation ; Field Topic 
B- - Regions ; Field Topic 



Asia-Padfic ; Field Topic 

'■ ■ Latin Antenca (South. CsntiaQ : Field Topic 
E ~ - Europe ; Field Topic 

" European Union ; Field Topic 
h - - Germary ; Field Topic 
I — The Netherlands : Field Topic 
I 'r EUFedeiai: Field Topic 

\ ] Austria ; Field Topic 

I Denrnaik; Field Topic 

; I L.„.. Prance ; Field Topic 
; j |— - Norway ; Field Topic 
j i — Sweden : Field Topic 

^ - United ICingdom ; Field Tope 

- International ; Field Tope 

Middle East , Field Topic 

flequiatory U\^\t ; fieid ToPiC 
•Unctassified Fields ; Field Tope 



1 




Figure 2 : DataCHEST Index Tree showing expanded Field Topics under the "Regions" category. 
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Data Fields 



Data Fields represent the actual data stored within the repository. In Figure 1 above, under the 
topic DataCHEST- Health, there are several examples of Data Fields, such as TSCA Sect. 5(e) 
- New Chennical Exposure Limits (NCELs)*, 'NIOSH - Health Standards - Pesticides", and "EPA 
5 - Carcinogen Hazard Ranking for RQ Adjustmenr. 

Data Fields are extremely varied. Although the process of database integration has only just 
begun, we have already identiHed over a dozen different data types, and it is expected that more 
will be added as additional databases are incorporated into our repository. The Data Types 
currently defined are as follows : 



10 



Boolean 
Formatted Text 
Free-format Text 
Hyperlinks 
Image 
Integer 

Large Text (Memo) 
LookHjp Table 
Numeric 

Substance Composition 
Other 
Field Topic 
Structured Object 



Simple binary values (True/False, Yes/No, Present/Not-Present) 
Text value formatted according to a set of rules 
Text value with no structure 

Link to an object outside of the substance-related database 

A graphic image or file. e.g. bit-map, JPEG. GIF. TIFF. etc. 

A whole number (no decimal places) 

Text value whose length exceeds 255 characters 

An index into a table of valid/possible values 

A numeric value associated with a defined unit of measure 

A special construct for specifying the composition of a substance 

A catch-all for otherwise unspecified data types 



For simplicity of design. Field Topics and Structured Objects are defined in the same Field 
Names and Field Description tables as the other data types. They can simply be considered as 
data types that have no data attached. 
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Structured Objects 

Stnictured objects were devised in order to group together those data elements that are related to 
each other by more than their related Substance. For example in Figure 3 below, ^'Toxicity Test" 
results are characterized by a number of Individual data elements such as "Test Subject 
Animal/Species', "Exposure Period^ and "Test Result Description'' that, individually, have very 
little significance. Taken as a *seM of data, however, these data elements comprise a 
representation of the test results. 



TempRead 




Substance Composition ; Substance Composition 

• Phywcd Piopcrtics ; Field T Dpic 

• T c»dcoloflj» ; Field Topic 

— NIOSH - Health Standards • Health Effects and Piecautions : Free-format Text 
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i - - EPA- ATSDR Priorilv Usl ; Fcee foimat Tod 
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i ICnois - PROPOSED Endocrine Disrupter List ; Free-format Text 

Canada - Accelerated Reduction/Elminalion of Toxics |ARET) ; FrBc-formal Text 

i Target Oigans ; Re!d Topic 
1^ ■ [EcolTowcily ; Ftdd Topic 

j 1 - Ecotcwdcily - Aquatic Towdty Data ; Ffee-format Text 

O ■ __„„„ _.. ._. 

- -To»cUy Tests: Type ; Look-up Table 

- Towci^ TestK Subject Species (abbreviatedl ; Free-format Text 

- Tonci^ Tests; Subject Animat/S pedes (Mouse. Rat. Rabbit, etc) ; Look-up Table 
—Toxicity Tests: Subject Species Ful Latin oi English Sctentllc Name ; Fiee format Text 

- Toxicily Tests: Exposure route .Look-up Table 

- - Toxicity Tests: Concentration vakie ; Numeric 
-Toxicity Tests: Exposure period ; Numeric 

-Toxicity Tests: Method or Eflecl (LC50. LD50. Death, etc.) ; Free format Text 
•Toxicity Tests; Method or Effect Type (LC50. LD50. etc.) ; Look-up Table 
-Toxicity Tests: Method or Effect Description ; Large TexIfMenw) 

- Toxicity Tests: Result Desciption ; Large Text (Memo) 
--ToMcity Tests: Reference Year : Integer 




* I Window 



Figure 3 : DataCHEST Index Tree.showing a Structured Object. 
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Title: DataCHEST Database Mapping Utility Design Specifications 



Abstract 



This document summarizes the design specifications for the program that will be used to update and maintain 
the tables that comprise the DataCHEST Index Tree and Field Mapping structures. This program is. intended 
to be used by the Source Database Mapping Team. This team is made up of Chemical, Health & Safety. 
Environmental Science. Transportation. Regulatory and/or other Industry specialists, who are not necessarily 
welt versed In database architecture and maintenance. 



The utility is required to facilitate the manipulation and update of the Index Tree Data Structure. The Index 
Tree is the central structure used to access information within the DataCHEST Repository. Because of the 
technical nature of the information stored within the repository, it is necessary for the Index Tree to be 
maintained by personnel with a strong background in Chemistry. Toxicology. Environmental Sciences, 
Transportation requirements for hazardous materials, and other regulatory fields. 

Such [personnel are not often skilled in database technology and administration. They must therefore be 
provided with a simple, intuitive utility that will aDow them to add fields, structured objects and field topics to 
the Index Tree, and to create and/or modify subject groupings and categories, and to classify fields within 
these groups. 

The process of maintaining the Index Tree involves the analysis and study of the source database, the 
mapping of the source data to appropriate fields within the DataCHEST Repository, creating them where 
necessary, and specifying data transformations to extract unitary data from composite fields within the source 
database. A rich library of appropriate topic Icons will be available to the mapping team so that they may 
associate the icons with the fields and topics that are added to the Index Tree, and improve the ease of use of 
the user interface. 

The DataCHEST Query Wizard must provide an on-line documentation and audit of all of the information 
available in the DataCHEST Repository. This documentation is a natural by-product of the mapping process. 
Therefore, the Index Tree Maintenance Utility should also update and maintain the Field_Mapping Table in 
which the documentation data is stored. 

Finally, as part of the process of maintaining and creating field entries in the DataCHEST Index Tree, the 
mapping team must create and maintain entries in the Look-up Table, which defines the set of possible valid 
values for the contents of data fields assigned to this data type. This function must also be incorporated into 
the 
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Table Structures used and updated bv the i\/laDDina Utility 



2 Relationships 




Figure 1 : Table Structures within the DataCHEST Repository Schema involved in the Index Tree 
Maintenance Utility 
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Title: DataCHEST Database Mapping Utility Design Specifications 



Program Flow Analysis 
Modules: 

f . Look-up Table 

2. Look-up Table Entries 

3. DataTypes 

4. Icons 

5. Field Definition 

(Field.Names. Field_Desciiptions) 

6. Field Grouping 



7. Providers 

8. Field Mapping 
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Abstract 

This document summarises the technology and applications that have been considered for use with the 
DataCHEST Query Wizard. 

Query Technology 



structured Queries 



Unstructured Queries 



Structured 
Data 



Bit-mapped 

Index 
Database 



Unstructured 
Data 




Query Types and Data Types 

Most database queries can be divided into two types. Structured Queries and Unstructured Queries. Structured 
queries are those that have been anticipated and planned for by the database designer, for example, querying 
accounts receivable information by asking for the customer by Customer ID number or name. The database designer 
plans for these queries by indexing the database on the query criteria in order to speed up the retrieval of the relevant 
information. The usual technology employed to support structured queries is a relational database, such as Grade. 
Sybase. Informix. MsSQL, Access, and others. 
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Unstructured queries, on the other hand, are unanticipated ones that do not foilow predefined procedures, and 
whose retrieval process may entail lengthy searches through every Item in the database. Typical examples of these 
are queries on historical sales data that attempt to establish trends according to geographical location, demographic 
criteria, or other elements that* because of space penalties imposed by the relational databases commonly used to 
store these types of transaction logs, are not already Indexed for speedy retrieval. 

To better manipulate unstructured queries, a new type of database indexing technique has emerged in the 
marketplace that uses a bit-mapped index structure. Three examples that we have reviewed are Oracle/8, 
SyBase/lQ. and Nucleus. 

Just like queries, data can be characterized Into two types. Structured data and unstructured data. Structured 
data is organized into individual elements, and cross-referenced by the different values that are related to them. 
Structured data can also be categorized into different standard fomiats. which make it easier to manipulate the data 
elements. 

Unstructured data, on the other hand, is not organized Into an ordered and defined structure. This does not mean 
that It Is necessarily disorganized, just that It is not laid out in tabular form. The relationships between the various 
data elements are based more on context or proximity, rather than on ordered lists and categories. The most common 
type of unstructured data is a collection of documents 

In order to be able to retrieve meaning from unstructured data, yet another technique Is required. Commonly 
referred to as full-text indexing, it is based on creating a positional index of every word of each document in a 
collection. Anyone who is familiar with current Internet technology has undoubtedly used search engines, which are 
the most common examples of this technology. 

Each of these three indexing techniques. Relational, Bit-mapped, and Full-text, exist primarily as separate tools 
(with the possible exception of Oracle 6). Although the industry is moving towards integrating them within a single 
application, the current state of the tools does not yet permit a seamless interface within a single, off-the-shelf 
;application. The DataCHEST Query Wizard Incorporates the necessary intelligence to not only utilize the most 
'appropriate technique for each query, but also to mix requests across alt three techniques. 

For example, it would be possible to import a list of CAS numbers of substances from which the user would like to 
select only those whose MSDSs have the text "in case of contact with eyes, fiush iiberatiy with water' located in the 
First Aid section, whose boiling point is less than 40**C, and for which the reportable quantity threshold is greater than 
30,000 metric tonnes usage per year. 
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frhtJ DataCHEST Shopping iyjail) 
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The Multi-threaded Index Tree concept is the fundamental building block under the DataCHEST Query Wizard. It is a 
new and unique synthesis of two common Graphical User Interface programming controls: Index Tabs and the Tree 
control. 

The design goal for the DataCHEST Query Wizard is the creation of an intuittvep easy to use indexed catalogue of 
data element descriptions contained within the DataCHEST data repository. Since it is expected that there will be 
several thousand distinct data element descriptions and topic headings, the challenge that must be overcome consists 
of navigating across the tree in a simple fashion that follows the users' thought processes as they formulate their 
queries. 

The Tree control is familiar to everyone who uses a file/directory manager (e.g. Windows Explorer), and thus provides 
a convenient method to navigate across each level of the tree. The myriad of data element descriptions can be 
categorised into groups, with either a parent/child connotation, or a topic/element relationship. 

The multi-threaded aspect of the tree structure is used to enable the user to follow multiple selection paths to arrive at 
the same data element description. For example, as shown in the table below, the topic «Toxicology» can fall under 
the topic headings «Chemical Propertiesw and «Health» with equivalent justification. Thus we are able to enter 
multiple relationships into our grouping database in order to represent the topic/sub>> topic relationship. 



Rooi 

Chemicals 

Physical Properties 

Physical Properties - Molecular Formulas 

Physical Properties - Structural Foimulas 

Physical Properties - Molecular Weights 

Physical Properties - Waler Solubirities 

Physical Properttas - pH Values 
Toxicolcgy 

Carcinogens 

ACGIH 1997 - Notice of Interxled Changes - Carcinoger^s . 
United Kingdom - Carcinogens 
lARC - Group 1 (carcinogenic to humans) 
tARC - Group 2A (limited human data) 
lARC - Group 28 (sufTictent animal data) 
Target Organs 

NIOSH 1997 - Pocket Guide - Target Organs 

Health 

Toxiootogy 
Carcinogens 

ACGIH 1997 - Notice of Intended Changes - Carcinogens 

United Kingdom - Carcinogens 

I ARC - Group 1 (carcinogenic to humans) 

lARC - Group 2A (limited human data) 

lARC - Group 2B (sufficient animal data) 
Taget Organs 

NIOSH 1997 - Pocket Guide - Target Organs 
Exposure Limits 

. TSCA - Sect. 5(e) - New Chemical Exposure Limits (NCEI^) 
Australian Exposure Standards - Carcinogens 



Table 1 - Sample Index Tree showing repetitive Branch Segments under multiple Topics 
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Figure 3 - Sample Tab Control (taken from Microsoft Word) 

The tab control toolbar enables another aspect of multi-threading within the index tree. Each root topic can also be 
considered as a filter, and it should thus be possible to apply multiple filters to the selection process. As the users 
descend the tree, the contents of the tab control v^ndow change to indicate the remaining available filters (root 
topics). Users can refine their search by clicking on a tab, whk:h will cause the selected root topic to appear below the 
currently opened tree node. The users can then continue navigating down the tree within the filtered set of data 
element descriptions. 
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Data Element descriptions within the Index Tree 

Another critical aspect of the Index Tree is the specification of the individual data element descriptions. This is 
primarily a manual operation, one that will require great care and discipline. 

As DataCHEST is consolidating data from a wide variety of sources, there will inevitably be a fair bit of overiap 
between the data supplied by the different providers. Unfortunately, not all of the overlapping data will be present in 
compatible formats, and it will be difficult, in certain instances, to convert the data into common fomis that will permit 
direct comparison of values between providers. 

The Data Element descriptions will thus have to reflect the format in which the data is stored. For instance, in the 
case of the property node for boiling points, there could be several child nodes Indicating the format of the data as 
shown in the example below: 



Physical Properties 
Boiling Point 

— Text form, units specified 

Numeric form, degrees Celsius at std. pressure 

Mumeric form, deg . Kelvin at 1.25 atmospheres 

Object form, units and pressure specified 



State change (text token) 

Temperature (numeric) 
Unit of measure (text token) 
Pressure (numeric) 
Unit of measure (text token) 
— Remarks 
Melting Point ... 



Sample values 

100"C at 760mmHg 

100 

375 

Boils (vs. Decomposes, explodes...) 
100 

degrees Celsius 

760 

mm Hg 

over open flame (vs. Microwave oven) 



Selecting a parent node, such as Object form or Boilixig Point, would return all of the child values associated 
with it. 
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Node controls 



T ' -io '^"'?3*-:j" Store) 



Once a data element node has been selected, the user will have the possibility of either retrieving all of the data 
associated with that node, or creating a selection criterion for the selected data element. Much like the User Interface 
for Microsoft Windows^*' Explorer, as each node of the left>pane tree is selected, a custom control is displayed in the 
right windowpane. 

Common control panes would include selection boxes for numeric values and text searches, although many other 
custom controls will also be created for the DataCHEST Query Wizard application. 

A special Substance Selector control would be created to select substances according to 

■ name, whether a chemical name, brand name or other type of synonym, 

; • governing body registration number, such as CAS (Chemical Abstract Service). EINECS (European Inventory 

of Controlled Substances). RTECS (Registry of Toxic Effects of Chemical Substances) or other organisation, 

- manufacturer and manufacturer's product code number, 

■ presence of an ingredient, possibly within a concentration range, or, 

■ any other form of identifying characteristic. 

1 A means will also be provided whereby the user can upload previously saved or automatically generated lists of 
substance identifiers. 

We will provide for custom panes for building news group profiles and for selecting consultants. 

Another custom control pane that we envision for inclusion in subsequent releases of our application would entail a 
graphic selection tool in which the user could build a structural representation of a molecule or sub-structure, and 
i search for a complete or fragmented match. 

Additional control panes can be added as necessary. However, every attempt should be made to work within an 
established set of controls so that confusion and training requirements are kept to a minimum. 

As each property node is selected, the Query Wizard can display a count of the number of data elements for that 
particular node, for instance, if we select Boiling Point, the application should display the fact that the Repository 
} contains information on 5000 substances. Displaying this type of information is considered to be good public 
relations, and there should not be a charge for it. 

On the other hand, displaying a filtered result can often have a definitive value and should therefore be chargeable. It 
is one thing to publicise that our Database contains t>oiling points for 5000 substances, and quite another to qualify 
that only 300 of these fall below -33* C 
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Request Buffer 

(Tjie 'Sliopping Cart') 
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As each criterion is completed, it is added to the request buffer. Several criteria can be combined into a single 
request, and. using the Boolean operators AND, OR, NOT and XOR, the request can be as general or as specific as 
needed. The user should have the necessary tools to modify the criteria making up a request at his discretion. 
Pressing the « Query Cost» button at any time should perform the request without retrieving the data, and display the 
net cost to the user, after applying the appropriate discounts. We are considering providing this information at no 
charge to the user. 

Querying the number of hits for a query could actually be useful information, and it is thought that a small charge 
would apply when the «Query # of Elements^ button is pressed. 

Once the user has decided upon a query, he or she would proceed with the transaction by pressing the 
«Retrieve Data (Purchase)^ button. Alter confirming the request and accepting to pay the quoted amount, the 
request would be re-executed (hopefully from cache) and the result set returned to the client in the requested format 
(HTML or JDBC data set). The Client would then be able to view and manipulate the data at his discretion. 
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Abstract 

This document covers interim modifications to the presentation of the Results Page of the Query Application. 

Modifications required for the Results Page presentation 
Display Structured Objects as a sub-group witliin the result bloclc. 

Currently, all data nodes are displayed in somewhat random order in the result block (group-by). If there are two 'hits' 
for the same SFID and Version JToken, they are grouped within the same field cell and displayed in random order. 

This is fine for independent fields (children of Field Topics), but is disastrous for dependent fields (children of 
} Structured Objects). 

We must come up with a way around this problem on the Results Page. Ideally, a Structured Object Parent Field 
should be displayed exactly as an independent field, and its children should be displayed to Its right in separate rows. 
I have given an example below: 





Water 


Substance CAS number 


000-00-0 




DataCHEST Sampie:v1 :DalaCH&ST Sample Oata;v1 : 




DataCHEST.com Inc. 


Independent Field 
(e.g. Boiling Point) 


= 212»F 


= 100 


Structured Object 

(e.g. Arsenic Levels in drinking 

water} 


Dependant field (e.g. Arsenic concentration %) 


< 0.1 % 


Dependant field (e.g. Years of daily exposure) 


10 


Dependant field (e.q. Degree of toxicity) ^ 


Not hamiful 


Dependant field (e.g. Arsenic concentration %) 


< 5.0 % 


Dependant field (e.g. Years of dally exposure) 


10 


Dependant field (e.g. Degree of toxicity) 


Sonnewhat hamiful 


Dependant field (e.g. Arsenic concentration %) 


> 10.0% 


Dependant field (o.g. Years of daily exposure) 


10 


Dependant field (e.g. Degree of toxicity) 


Fatal 
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Results display: Example of a structured object. 

Please find below an example of the results table taken from a sample query requesting TSCA Health & 
Safety information for substances whose name begins with Toluen". Four Structured Objects were selected 
for "Display^ In the dynamic filters. These were: 

U.S. EPA . TSCA - Sect. 8 (d) - Health and Safety Reporting List 

U.S. EPA - TSCA - Sect. 8 - Health and Safety Reporting List - Effective date 
U.S. EPA - TSCA - Sect. 8 - Health and Safety Reporting List - Reporting date 
U.S. EPA - TSCA - Sect. 8 - Health and Safety Reporting List - Expiry (sunset) date 

U.S. EPA - SARA - 312 - Chronic health hazard List 

U.S. EPA - SARA 312 - Chemical listed as a chronic health hazard - (Yes/No) 
U.S. EPA - SARA 312 - Category for chronic health hazard 

U.S. EPA - TSCA - Sect. 8 (d) - Health and Safety Studies Reporting List 

U.S. EPA • TSCA - Sect. 8 (d) - Health and Safety Studies Reporting List (Present) 
U.S. EPA - TSCA - Sect- 8 (d) - Health and Safety Studies Reporting List - Sunset date 

U.S. EPA - SARA - 312 - Acute health hazard List 

U.S. EPA - SARA 312 - Chemicals listed as an acute health hazard - (Yes/No) 
U.S. EPA - SARA 312 - Category for acute health hazard 

The relevant fields of the query results table for this request are listed on the next page: 
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SQL> select sfid. versiorutoken. fgSparent_tolcen . parent_de1 d . fgSdcseQuence, deld 

2 from query.ol.datsLjelenents where QUERY^SUHMARY^XO - 4346 

3 order by sfld. versloiutoken, fgSparent_toicen • parent^deid. fgSdcsequence; 



SFZO VERSION_TOKEN FGSPARENTJTOKEN PARENT J>EZD FGSDC^EQUENCE 



OEID 



11490 
11490 
11490 
15422 
1S422 
15422 
160 25 
16025 
16025 
20002 
20002 
20002 
47690 
47690 
47690 
79312 
79312 
79312 
138057 
138057 
138057 
142833 
142833 
142833 
IS 59 10 
155910 
155910 
159373 
159373 
1S9373 
159378 
159378 
159378 
160038 
160038 
160038 



18 
18 
18 
18 
18 
18 
18 
18 
18 
18 
18 
18 
18 
18 
18 
18 
18 
18 
18 
18 
18 
18 
18 
18 
IB 
18 
18 
18 
18 
18 
18 
18 
18 
18 
18 
18 



264 
264 
264 
264 
264 
264 
264 
264 
264 
264 
264 
264 
264 
264 
264 
264 
264 
264 
264 
264 
264 
264 
264 
264 
264 
264 
264 
264 
264 
264 
264 
264 
264 
264 
264 
264 



10431722 
10431722 
10431722 
10431882 
10431882 
10431882 
10432038 
10432038 
10432038 
10432306 
10432306 
10432306 
10433062 
10433062 
10433062 
10433686 
10433686 
10433686 
10434454 
10434454 
10434454 
10434518 
10434518 
10434518 
10434606 
10434606 
10434606 
10434686 
10434686 
10434686 
10434694 
10434694 
10434694 
1043 4 718 
10434718 
10434718 




10431723 

9999993 
10431725 
1C431883 

9999993 
10431885 
10432039 

9999993 
10432041 
10432307 

9999993 
10432309 
10433063 

9999993 
1043306S 
10433687 

9999993 
10433689 
10434455 

9999993 
10434457 
10434519 

9999993 
10434521 
10434607 

9999993 
10434609 
10434687 

9999993 
10434689 
10434695 

9999993 
10434697 
10434719 

9999993 
10434721 



16671018 
5 

16671020 



M 




215091 
215091 
215091 
215139 
215139 
215216 
215216 
215777 
215777 
215777 
215777 
215783 
215783 
215783 
215783 
215998 
215998 
215998 
215998 
221865 
221865 
221865 
221865 
223034 
223034 
223034 
223034 
223175 
223175 
223175 
223175 
272152 
272152 
272152 
272XS2 

83 rows selected. 



55 
5S 
55 
55 
55 
55 
5S 
55 
55 
55 
55 
55 
55 
55 
SS 
SS 
SS 
55 
55 
55 
55 
55 
55 
55 
55 
55 
55 
55 
55 
55 
55 
55 
55 
55 
55 



22741 
22741 
22741 
22741 
22741 
22741 
22741 
22741 
22741 
22741 
22741 
22741 
22741 
22741 
22741 
22741 
22741 
22741 
22741 
22741 
22741 
22741 
22741 
22741 
22741 
22741 
22741 
22741 
22741 
22741 
22741 
22741 
22741 
22741 
22741 



16672283 
16672285 
16672285 
16672357 
16672357 
16672359 
16672359 
16671029 
16671029 
16671031 
16671031 
16670977 
16670977 
16670979 
16670979 
16671841 
16671841 
16671843 
16671843 
16671041 
16671041 
16671043 
16671043 
16672235 
16672235 
16672237 
16672237 
16672259 
16672259 
16672261 
16672261 
16671021 
16671021 
16671023 
16671023 



16672264 

5 

16672286 
5 

16672358 
S 

16672360 
5 

16671030 
5 

16671032 
5 

16670978 
5 

16670980 
5 

16671842 
5 

16671844 

5 

16671042 
5 

16671044 
5 

16672236 
5 

16672238 
5 

16672260 
5 

16672262 
5 

16671022 
5 

16671024 
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This data should be presented in the manner shown below, where the text descriptions have been replaced 
with the token values from the above table. 

Of the three table blocks below, the first is a structure representation shown with the variable names, and the 
next two are^ty pjcal exam p les us ing rep rese ntative sets of data have been extracted from the above data (as 
shown in R^ si^ektlghM C ra g^^ 

In the case of an autonomous field (not a child of a structured object), the FG$PARENT_TOKEN and 
FG$DC_SEQUENCE values can be safely ignored (they should be Null in any case), and the PARENT_DEIO 
value should be Null or 0 (zero). " 



StxbBtaztce 


SFZD 


Varsion 


VERSION TOKEN 


Autonomous Field 
PZBLO TOICEN 


DBID 


structured Object 

FO$ PARKNT_TOKra 




FZSZJ>_TOKBN 


DBZD 



The middle block displays a structured object in a query result that returned a single set of data elements. 



Substance 


160111 


Version 


18 


Structured Object 
264 




Field 


Token 


265 


1 DEID 


10434739 




Field 


Token 


266 


j DEID 


9999993 




Field_ 


Token 


267 


1 DEID 


10434741 



The last block displays a structured object in a query result that retumed a multiple set of data elements. 



Substance 


214851 


Version 


55 






Field 


Token 


22742 


DEID 


S 


Structured Object 




Field 


Token 


22743 


DEID 


16672140 


22741 




Field 


Token 


22742 


DEID 


5 






Field_ 


Token 


22743 


DEID 


16672142 



I, and are used only for ordering and grouping. 



The breakpoints on Substance Form (SFID) and Information Version (VERSION_TOKEN) serve to open a 
new grouping table. The PARENT_DEID serves to Identify the sub-groups for the structured object data sets 
3 within the SFIDA/ERSlON_TOKEN group. Lastly, the FG$DC_DEQUENCE serves to order the 

FIELD_TOKEN/DEID elements within the sub-group. 



The information contained in this document is proprietary to i^taCHEST.com Inc, confidentiBi and Intended onJy for tntamaf use or 
for communication with DataCHSST clients, supptiers and business partners. 

066 



BNSDOCID: <WO 0159613A2_I_> 



wo 01/59613 PCT/IBOl/9036^ 



DataCHEST 



com 



'4,. 15:32. Page of 9 



Title: Result Presentation Notes 



Addition of Effective Date and Expiry Date fields to Preset Filters 



Also, as we discussed this morning, with the new fields (Effective/Expiry Dates) being added to, or corrected in. the 
Presets, we will have to change the results page once again. We have two choices here: 

1 .) we can add these fields to the GROUP^BY criteria, as in the following example 



Substance Name 


Water 


Substance CAS number 


000-00-0 


Infomiaton Version 


DataCHEST Sample;v1;DataCHEST Sample Data;v1; 


Informaton Provider 


DataCHEST com Inc. 


Information Effective Date 


15-NOV-1999 


Inf oimation Expiry Date 


3l-Dec-2000 


Independent FiekJ 
(e.g. Boling Point) 


= 212'F 


slOO-C 




Substance Name 


Water 


Substance CAS number 


000>00-0 


Informatfor. Version 


DataCHEST Sampte:v1; DataCHEST Sample Data;v1; 


Information Provider 


DataCHEST.com Inc. 


Information Effective Date 


01-Jan-2000 


Infbmiation ExDirv Date 




Structured Object 

(e.g. Arsenic Levels In drinking water) 


Dependant field (e.g. Arsenic concentration %) 


< 0.1 % 


Dependant field (e.g. Years of dally exposure) 


10 


Dependant field (e.g Degree of toxicity) 


Not hannful 


Dependant field (e.g. Arsenic concentration %) 


< 5.0 % 


Dependant field (e.q. Years of daily exposure) 


10 


Dependant field (e.g. Degree of toxicity) 


Somewhat 
harmful 


Dependant field (e.g. Arsenic concentration %) 


> 10.0% 


Dependant field (e.g. Years of daily exposure) 


10 


Oeoendant field (e.q. Degree of toxicity) 


Fatal 
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2.) or. we can add columns to the result table like this : 





;;Data^^?i§5^f«|?»^^ 


Substance Name 


Water 


ouijaiance \^f\^ nurnuci 


000-00-0 


Information Version 


DataCHEST.con^ Inc.: v1 : Sample Data: v1 : I 


Ol^an-2000 • 1 


31-Dec-2000 




DataCHEST.com Inc. 


Independent Field 


= 212 


15-Nov.ig99 


31-Oec.2000 


(e.g. Boiling Point) 


= 100 »C 








Dependant Held 

(e.g. Arsenic concentration %) 


<0.1% 


O1-Jan-2000 


31^00-2000 




Dependant field 

(e.g. Years of daily exposure) 


10 


O1-Jan-2000 


31-Dec-2000 




Dependant field 

(e.g. Degree of toxicity) 


Not harmful 


01-Jan-2000 


31-Dec-2000 


Staictured Object 
(e.g. Arsenic Levels in 
drinking water} 


Dependant field 

(e.g. Arsenic concentration %) 


<5.0% 


Ol^an.2000 


31-Dec.2000 


Dependant field 

(e.g. Years of daily exposure) 


10 


01-Jan-2000 


31-Dec.2000 


Dependant field 

(e.g. Degree of toxicity) 


Somewhat harmful 


01-Jan-20O0 


31-Dec-2000 




Dependant field 

(e.g. Arsenic concentration %] 


> 10.0 % 


01-Jan-2QOO 


31-Dec-2000 




Dependant field (e.g. Years of 
daily exposure) 


10 


01-Jafv2000 


31-Dec-2000 




Dependant field (e.g. Degree 
of toxicity) 


Fatal 


O1-Jan-2000 


31-Dec-2000 



Or like this: 



■f ^Ti'V.i-^'**Tj?-!?iHA:?r^-.ilTi^-S.i;4TT.-.-/ 




Effective Data 

'ExpiryDaie:^^^;:'^ 


Substance tvlame 


Water 


Substance CAS number 


000-00-0 


Information Version 


DataCHEST.com Inc.; v1; Sample Data; v1; 


O1-Jan-2000 
31-Dec-2000 


Information Provider 


DataCHEST.Gom Inc. 


Independent Field 
(e.g. Boiling Point) 


«212*»F 


15-NOV-1999 
31-Dec-2000 


= 100*'C 


15-Nov>1999 
31-Dec-2000 


Structured Object 
(e.g. Arsenic Levels in 
drinking water) 


Dependant field 

(e.g. Arsenic concentration %) 


< 0.1 % 


Ol^an-2000 
31-Dec-2000 


Dependant field 

(e.g. Years of daily exposure) 


10 


O1-Jarv2000 
31-Dec-2000 


Dependant field 

(e.g. Degree of toxicity) 


Not harmful 


01-Jarv2000 
31.Dec-2000 


Dependant field 

(e.g. Arsenic concentration %) 


< 5.0 % 


O1-Jan-2000 
31-Oec-2000 


Dependant field 

(e.g. Years of daily exposure) 


10 


01-Jan-2000 
31-Dec-2000 


Dependant field 

(e.g. Degree of toxidty) 


Somewhat harmful 


O1-Jarv2000 ' 
31-Dbc-2000 : 


Dependant field 

(e.g. Arsenic concentration %) 


> 10.0 % 


01-Jan-2000 > 
31.Dec-2000 . 


Dependant field (e.g. Years of daily exposure) 


10 


01-Jan-2000 - . 
31-Dec-2D00 v 


Dependant field (e.g. Degree of toncity) 


Fatal 


01-JaTv2000. • 
31-Dec-2D00 . 
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Addition of Synonym Type 

Another issue that affects the same code in the application is the display of the synonym type Information. 
Because a synonym type is related to a specific name. It must be attached to the name when it is being 
displayed. Ideally, this would add another sub-cell (column) to the name data, as demonstrated by the 
following example: 









Substance Name.. ' 


Water - 


Common Name . 




Ldurehtlan Spriinig Water 


Trade or Brand Name 


Substance CAS number 


OOO-OO-O 


-information Version 


DataCHEST.ccm Inc.: v1: Sample Data: v1: 


Information Provider 


DataCHEST.com Inc. 


Independent Field 


= 212 


(e.g. Boiling Point) 


» 100 *C 
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Abstract 



This document describes the design of the DataCHEST data repository schema. It attempts to explain the 
rationale behind the design, as well as the various dimensions of the data contained within the repository. 
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The DataCHEST Data Repository 

The diagram on the following page (Figure 1) displays the database schema for the main relational database within 
the DataCHEST repository. 

One of the initial criteria for the design of the database is the requirement to support all text fields in the database as 
multliinguat fields. This criterion results in some additional complexity within the database. For example, in the 
Substances table, rather than including a simple text field for ^'Ph^icat State", it is necessary to include a token 
pointing to another table containing the text field in the various languages desired. The same logic dictates the 
inclusion of several other tables, like the Data_Types table, the Field.Names table, and the Synonym_Types table to 
name a few. 
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Design Objectives 

The Schema becomes considerably less cluttered if the language relationships are hidden, as evidenced by the 
diagram in Figure 2 on the following page. 

The objective for the database design is to achieve a structure In which It is possible to assimilate a large number of 
source data elements originating within many different source databases, each with its own particular structure, 
language, and environment 

For the most part, the source data elements all relate to one or more substances, sometimes dependent on the 
physical state of the substance (solid, liquid, gas, etc.), and also at times dependent upon the source or manufacturer 
of the substance. Data elements must also be identified by the provider, version and authority of their source. 

To represent such a structure as a common two-dimensional tabular matrix would be cumbersome and unwieldy, 
resulting in thousands of columns of data and hundreds of thousands of rows of substances. The concept of data 
sources and versions would only add to the confusion. 

By converting the tabular column dimension into a virtual Yow* dimension (the Field Dimension, examined below), we 
are able to more efficiently manage and manipulate the large amount of data in the repository. 
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The Data Elements Table 



The strategy employed by the DataCHEST repository is to convert the 'rows', 'columns' and 'data source and version' 
information into dimensions of a central 'Data_Elements* table, as highlighted in Figure 3 on the following page 

Each of the 'Data_Elements' table entries Is identified by a Substance (USID, or unique substance identifier), a 
Field^Token' indicating the nature of the data element and an Authority/Version token (AV_Token) representing the 
source and version of the data element. 

In view of the wide variety of data element types, it would be an extremely inefficient waste of table space to store the 
actual data in the same table as the dimensional infomnation. Instead, a series of data tables are used to store the 
data, one for each data type. The data in these tables are linked to the 'Data_Eiements* by another token, the Data 
Element Identifier (DEID). 

The nature of the source data is such that a series of data elements occasionally contain entries that include 
comments or qualifiers describing or limiting the individual value. To accommodate this characteristic, the 
Data.Elements tatMe.is linked to a Remarks_Tabte via a separate token (Rmk_Token). 

In order to support the notion of structured objects, the Parent_DEID field has been included in the 'Data_Elements* 
table. This will be explained later in the Data Dimension section of this document. 

For billing purposes, cost and pricing information is included with each Data Element entry. 

Date information is also recorded for effective and expiry dates, as well as for the creation date of the data element. 
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The Substance Dimension 



As indicated earlier the Data_Elements table is related to a Unique Substance Form, qualified by a unique 
Substance, a Physical State and a source (Manufacturer). Multi-lingual 5ut>stanc8 names are held in a separate table 
(Subst_Names). Since a substance can have more than one name, or synonym, every name is characterized by a 
Synonym_Type. and again. In order to be able to store multi-lingual synonym names, they are. in turn, stored in their 
own table. 

In order to model the many-to-many relationships between Substance Names and Substances and Substance fomns, 
two extra tables, Subst_to_Name and Form_to_Name are used. These tables consist of a Name Token and unique 
Identifier that is used to link to the Substance and Substance Forms table. 

When trying to understand the data model, it is useful to think of a substance, for instance water, which exists in 
several physical forms, namely ice, water and steam. Water can also originate from different sources, for example, 
distilled water, Perrier water, Evian water, etc. 

Each of these varieties of water can possess attributes that are common across many varieties, such as viscosity or 
vapor pressure. They can also have properties that are unique to a physical state or source, such as density or 
salinity. 

A single variety of water can also have many synonyms. Take snow, for instance, which can be called powder snow, 
com snow, or a variety of other names. The Eskimo language, Inuktituk. uses over 40 different terms to describe 
snow. 

More to the point, various regulatory bodies have catalogued chemical substances for their own internal purposes, 
and often use their own version of a name to describe the sut>stance. Because these regulatory listings have the 
force of law. the names used to describe a substance become unique synonyms, even if they differ from one another 
due to punctuation or even a spelling mistake. 
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The Data Field Dimension 

The Data Field dimension describes the nature of the data elements within the repository. 

Data Fields have two basic attritnjtes, the Field Name and the Data Type. In order to accommodate multilingual 

naming of Field Names and Data Types these attributes must be stored In separate tables. 

In order to permit users to view a Data Field's documentation on-line, a Field_Mapping table has been defined to 

describe its relationship to the original source data as furnished by the data provider. 

Another construct, the Fleld_Grouping table, was created In order to implement the Query Wizard Index Tree 
structure. This table identifies pairs of parents and children and an optional sequence number for specifying an 
ordering for the children. 

To enhance the look and feel of the user interface, a Data Field may optionally be associated with an icon, which will 
be displayed next to it in the Query Wizard Index Tree. Icons are also associated with data types, so a Data Field can 
thus have two icons associated with it. Because more than one Data Field can apply to the same icon, they have 
been stored in their own table. 

Finally, the Lookup_Table construct has been created to enhance the Index Tree and make data storage more 
efficient. It is used for data element types where the possible values fall within a pre-determined set of values. For 
example. lARC, the International Association for Research into Cancer, publishes a list of chemical substances 
together with their assigned carcinogen Idty category, as described in the table below: 




97; 1^0059; 
98; 10059^ 
99 f 10059 
'I00ri0059' 



97i 



MARC (International Agencyfor^Research^on Cancer) Carci nogen Cla sses 
1 j Carcinogenic to humans 



97; 2: 2Ai Suspected Human Carcinogen • Limited human data 

97 i 3 i 2B I Suspected Human Carcinogen - Sufficient animal data 



101; 10059 
'l02Mq059 
103 1 10059 



97 i 



3: Not classifiable 



97! 



4 i Probably not carcinogenic 



6: 



71 5; Unspecified 



The use of the lookup-table construct permits the display of the possible values for the lARC Carcinogen classes, 
together with a more detailed explanation of their meaning. By implementing the lookup table as part of the Data 
Field dimension, the values can be displayed as part of the Index Tree and as part of the selection criteria drop-down 
tables. 
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The AuthoritvA/ersion, or Source, Dimension 

This last dimension of the Data Element table is the AuthorityA/ersion dimension. Each data element is identified by 
the source from which it has been provided (the Provider database), together with the original source authority (the 
Authority_Names database) from which the data was taken. Because data providers issue periodic releases of their 
data, versioning information must also be incorporated into this dimension. As well, it is not uncommon for multiple 
data providers to furnish data from the same source authority, as in some cases, this is the only valid way to obtain 
the pertinent data. 
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The Data Dimension 



As previously mentioned, data In the DataCHEST repository is stored In separate tables from the Data_Elements 
table, one table for each data type. Although technically not a dimensional element, we are considering it as such for 
the purposes of this document. 

It is not intended that the table constructs remain static. Rather, new tables can be added as new data types are 
necessitated by the soun:^ databases that are integrated into the repository. 

We have so far provided for the two basic data types: 

• Boolean, for simple binary values, tike "YesTNo", •TrueTFalse", "PresentTNot Presenr, etc. 

• Integer, for simple numeric values that are unltless, like year, count, etc., as well as for Index entries into the 
Lookup-Table 

As well, we have provided for multilingual text data types, consisting of a data element paired with a Language Token 
linking the data element to the Language table. 

• Text, for multilingual text values up to the maximum allowable length for character strings under the lowest 
common denominator for ODBC databases, which Is currently 255 characters. 

• Unstructured Text, for multilngual text values that exceed the above maximum length for character strings. 
Two additional data type constructs have been specially adapted to the source data that we have examined so far: 

• Numeric, for data elements that are represented by non-integer values and/or that require a unit of measure 
for Interpretation. This construct has a numeric data element and a link to a Unlts_of__Measure table, 

I containing multilingual definitions of the various units of measure that have been defined during the source 

data transformation process. 

• Substance Composition, for data elements defining the ingredients of substances in the Substance 
database that are mixtures made up of other substances In the Substance databases. This construct is 
composed of sets of ingredients, their fractions and the corresponding units of measure. 

As indicated earlier, data elements can be further characterized by individual remarks held in the Remarks table. 
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1 Abstract 



This document discusses the concepts surrounding the creation and display of Sub-Trees in the 
DataCHEST Query Builder application, and offers implementation suggestions. 



The DataCHEST Poly-Hierarchical Index Tree Is a unique and very powerful structure for classifying the various 
Data Fields contained In the DataCHEST Repository. At the same time as it provides an organizational reference 
(table of contents') In which to locate a particular Data Field. It also provides a context for each node that can be used 
to help refine any searches that are performed for or on the node. 

The large number (thousands) of Data Fields that are integrated into the DataCHEST Repository makes browsing 
the full tree a little cumbersome even though the tree is classified Into topics and sub-topics. The concept of creating 
a sub-tree Involves applying a variety of different filters against the entire tree-structure In order to be able to visualize 
a sub-set of the tree that may correspond to a particular set of constraints which better meet the criteria that a 
particular user is lodging for. 

For example, if a user is searching the Index Tree for data pertaining to a particular substance, it would be much 
more efficient to present an Index Tree containing only those data nodes and their respective parent topics which 
contain data for the substance under consideration. In another situation, if a user is looking for North American 
Occupational Exposure Limits, he or she will find them much more quickly if an index Tree can be constructed from 
only those nodes that relate to both of these criteria. 

It will become evident from the examples later In this document that the large variety of additional criteria that can be 
used to filter the Index Tree will increase the utility and usefulness of the Index Tree structure by several orders of 
magnitude. 
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3 Design 



The full Index Tree is currently created off-line during the Data Repository update procedure by processing 
the tables in the Normalized Database Instance with an Oracle PL/SQL script. This script creates a de- 
normalized table that maps to each node in the tree, even if the same field token appears more than once. 
The de-normalized table is then used to display the tree structure within the browser. 

A similar process, written Into a JSP (Java Server Page) routine can be used on-line to create a temporary 
table in exactly the same format In order to implement and display ad-hoc Sub-Trees at the user's discretion. 
The difference between the full Index Tree and the Sub-Tree is that the Sub-Tree is built using a subset of 
the nodes that appear in the full tree. The subset is determined by a variety of criteria that are meaningful to 
the user. 

During the current design cycle, we will be implementing the following selection criteria for Sub-Tree creation: 

1 . Nodes that contain data pertaining to a set of substances selected by Name. CAS and/or other criteria 

2. Nodes that contain data originating from a specific Provider 

3. Nodes that contain data originating from a specific Authority 

4. Nodes that contain data originatirig from a specific Provider Version 

5. Nodes that contain data originating from a specific Authority Version 

6. Nodes whose field name matches a keyword or topic search (e.g. contains the word 'drinking') 

7. Nodes which contain one or more specific data types 

8. Nodes found within a sub-branch of the full tree 

(These will usually also be found below other branches in the tree) 

Note that the application should permit the user to create a Sub-Tree by combining one or more of the above 
criteria in a Boolean expression. Initially, the expression should be limited to a simple AND/OR combination 
of each criteria, but as the application becomes more sophisticated, complex Boolean expressions should be 
possible. 

The details of each of the above Sub-Tree Selection Criteria will be discussed below. 



3.1 Nodes that contain data pertaining to a substance or a set of substances 

This criteria requires that a search be perfomned on the Data_ElGments table for all records whose 
Substance Form Identifier (SFID) are found in the list resolved from the Substance Selection criteria specified 
by the User. The results must t>e grouped by the Field_Token column of the table, and then used to select 
which nodes are eligible for inclusion in the Sub-Tree. 

A sample SQL request to extract the candidate Fleld_Tokens is listed below: 
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SELECT 
PROM 
GROX7P BY 
HAVING 



DATA_ELEMENTS . FIELD_TOKEN, DATA^EIiEMENTS . SFID 
DATA_ELEMENTS 

DATA_ELEMENTS . FIELD_TOKEN, DATA_BLEMENTS . SFID 
( ( (DATA_EIiEMENTS.SFID) In (10346, 11002) ) ) 
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3.2 Nodes that contain data originating from a specific Provider 

This next criterion requires that a search be again perfomied on the Data^Elements table, but this time for all 
records whose Version_Token Is related to a specific Provider or a set of Providers. As before, the results 
must be grouped by the Field_Token column of the table, and then used to select which nodes are eligible 
for inclusion in the Sub-Tree. 

The SQL request in this case Is slightly more complex, as It must Join several tables in order to extract the 
candidate Field^Tokene. A sample is again listed below: 




SELECT DATA_ELEMENTS . FIEL.D_TOKEN , PROVIDER_VERS IONS . PROV_TOKEN 
PROM DATA_ELEMENTS INNER JOIN ( INFO^VERSIONS INNER JOIN 

I PROVIDER^VERSIONS ON INFO_VERS IONS . PV_TO KEN = 

PROVrDER_VERSIONS. PV_TOKEN) ON DATA_ELEMENTS . VERSION_TOKEN = 

INFO__VERSIONS . VERSI0N_T0KEN 
GROUP BY DATA_ELEMENTS.FIEI*D_TOKEN, PROVIDER_VERSIONS . PROV TOKEN 
HAVING ( ( ( PROVIDER_VERS IONS . PROV_TOKEN) = 1 ) ) ; " 



3.3 Nodes that contain data originating from a specific Authority 

This next criterion requires that a search be again perfomied on the Data^Elements table, but this time for all 
records whose VersionJToken is related to a specific Authority or a set of Authorities. As before, the 
results must be grouped by the Field.Token column of the table, and then used to select which nodes are 
) eligible for inclusion In the Sub-Tree. 

The SQL request In this case is a minor variation on the one used previously for the Provider criterion. A 
sample is again listed below: 

SELECT DATA_ELEMENTS . FIELD_TOKEN , AUTH__VERSIONS . AUTH_TOKEN 
FROM (DATA_ELEMENTS INNER JOIN INFO_VERSIONS ON 

5 DATA_BLEMENTS.VERSION_TOKEN = INFO_VERSIONS . VERSION_TOKEN) INNER 

JOIN AUTH_VERSIONS ON INFO_VERSIONS . AV TOKEN = 

AUTH_VERsiONS . AV_TOKfiN " 
GROX7P BY DATA_ELEMENTS.FIELD_TOKEN, AUTH_VERSIONS . AUTH_TOKEN 
HAVING ( ( (AUTH_VERSIONS . AUTH_TOKEN) =1) ) ; 



3.4 Nodes that contain data originating from a specific Provider Version 

This next criterion requires that a search be again performed on the Data.Elements table, but this time for all 
records whose VersionJToken is related to a specific Provider Version or a set of Provider Versions. As 
before, the results must be grouped by the Field_Token column of the table* and then used to select which 
3 nodes are eligible for inclusion in the Sub-Tree. 

The SQL request in this case is a minor variation on the one used previously for the Provider criterion. A 
sample is again listed below: 

SELECT DATA_ELEMENTS . FIELD_TOXEN , INF0_VERSI0NS . PV_TOKEN 

FROM (DATA_ELEMENTS INNER JOIN INFO_VERSIONS ON 

D DATA^ELEMENTS .VERSIONJTOKEN = I NFO^VERS IONS - VERSIONJTOKEN) 

GROUP BY DATA_ELEMENTS.FIELD_TOKEN, INFO_VERSIONS . PV_TOKEN 

HAVING ( ( ( INFO_VERSIONS . PV_TOKEN) =1 ) ) ; 
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3.5 Nodes that contain data originating from a specific Authority Version 

This next criterion requires that a search be again performed on the Data_Elements table, but this time for all 
records whose Version_Token is related to a specific Authority Version or a set of Authority Versions. As 
before, the results nnust be grouped by the Field.Token column of the table, and then used to select which 
nodes are eligible for inclusion in the Sub-Tree. 

The SQL request in this case a minor variation on the one used previously for the Authority criterion. A 
sample is again listed below: 

SELECT DATA_ELEMENTS . FIELD_TOKEN, AUTH_VERSIONS . AV_TOKEN 
FROM (DATA_EIiEMENTS INNER JOIN INFO_VERSIOh3S ON 

DATA_ELSMENTS . VERS ION_TOKEN * INFO_VERS I ONS . VERS ION_TOKEN ) 
GROXTP BY DATA_ELSMBNTS.FIELD_TOKBN, INFO_VERSICNS . AV_TOKEN 
HAVING ( ( ( INFO_VERS I ONS . AUTH^TOKEN ) = 6 9 ) ) ; 




3.6 Nodes whose field name matches a Iceyword or topic search 

This criterion searches against the Field_Names table looking for matches against one or more keywords or 
keyword fragments. In the Visual Basic Prototype version, the results of this search were used to highlight 
matching tree nodes, but in the Browser-based Java version, it is much easier to simply create a sub-tree for 
the matches. 

The sample SQL script Is listed below: 

SELECT [FIELiD_NAMES] . tFIELD_TOKENl , LCase$ ( [FIELD_NAMEl ) AS Exprl, 

[FIELD_NAMES) . (LANG_TOKENl 
FROM FIELD_NAMES 

WHERE ( ( (L.Case$ ( [FIELD NAME] ) ) Like "%drinking%" ) And 
< ( [FIELD_NAMES] . (LANG_TOKEN3 )=!)); 



3.7 Nodes which contain one or more specific data types 

This criterion searches against the Fleld_Descriptions table looking for fields with one or more specific data 
types. This would be most useful for those users who are looking for specific Information formats, for 
example, in order to load a database. 

The sample SQL script is listed below: 

SELECT FIELD_DESCRIPTIONS . FIELD_TOKEN, FIELD^DESCRIPTIONS . DTATYP^TOKEN 
FROM FIELD_DESCRIPTIONS 

WHERE ( ( ( FIELD_DESCRIPTIONS . DTATYP_TOKEN) In (1,3,7,9,10))); 
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3.8 Nodes found within a sub-branch of the full tree 

This te the original criterion for which the concept of the Sub-Tree was created. This feature allows the user 
to refine his search throu^ the tree by specifying progressive refinements. For instance, creating a sub-tree 
on Regulatory Lists, followed by a second sub-tree on Regions - North America - United States -US 
Federal, would yield a sub-tree containing only U.S. Federal Regulatory Lists, which could then be browsed 
by descending another branch, such as Health. -"wobu 

I N ®r?f rf .ti'l^crt^'f'?.^' to create this sub-tree would be to collect Field Tokens from the temporary 
GUI DATAJNDEX table cun^ntly in use all the way down the branch from the currently-selected node 
resoh^e any duplicate tokens, and then buUd a new table from the resuHIng list of field tokens. I will not hazard 
an attempt to create an SQL script for this task, as it is probably better handled by a programmed sequential 
read until the next peer or superior node is found. r h "a 00 aequennai 
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4 Implementation Strategies 

(NOTE: This preliminary strategy is only speculative and must be tested further. Also, it should be noted that 
both ccxje examples Included in the Appendices were written very quickly and can probably benefit 
considerably from some optimization analysis.) 

in the design phase, samples of the SQL scripts that extract the list of candidate Field Tokens for the new 
Sub-Tree have been stiown. The PUSQL procedure script FIELD_DATAJNDEX that creates the 
GULDATAJNDEX table is included in ApF>endix A 

The relevant Sub-Trees could be created by the existing FIELD_DATAJNDEX procedure by simply 
modifying the query inside the PUSQL cursor C^FIELDGRP. declared in the Get_Child sub-procedure. A 
condition must be added to the WHERE clause indicating that CHILDJTOKEN is IN the set defined by the 
appropriate SQL statement for the method desired. 

By way of example, to create a sub-tree for nodes containing the word "drinking", the C^FIELDGRP cursor 
could be defined as follows: 

SBIiECT CHILD_TOKEN, PARENT_TOKEN 

FROM F I ELD_GROUP I NGS 

WHERE PARENT_TOKEN « P_CHILD 

AND CHIIiD_TOKEN != PARENT_TOKEN 

AND CHIIiD TOKEN IN 

(SELECT [FIELD_NAMES] . [FIELD_TOKEN] , LCase$ ( [FIELD_NAME] ) AS 
Exprl, [FIELD_NAMES] . [IiANG__TOKEN] 

FROM FIELD_NAMES 

WHBRB ( ( (LCase$ ( tFIELD^NAME) ) ) Like " %drinlcing% " ) And 
( ( [FIELD_NAMES1 . (LANG_TOKEN] ) =1) ) 

) 

ORDER BY FIE1jD_GROUPINGS.DC_SEQUENCE 

Once the Sub-Tree has been built, it must be 'cleaned' to remove field nodes without data and topic nodes 
without fields. This is accomplished using the logic described in Appendix B, which contains the commented 
Visual Basic code of a procedure written by the DataCHEST integration team for this purpose. 
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Appendix A PL/SQL Procedure FIELD_DATAJNDEX 



- icit les rejets ne sont pas dteomptd dans la chltdcount voir si cela a un impacte 

create or replace function FIELD_NAME_DATAJNDEXE$ (P_LANG_TOKEN number) return number is 

- Cette Fonction execute le chargement de GULDATAJNDEXES 

- Si une erreur est d6tect6e dans ce programme, alors le r^sultat 

- du renvoi de la fonction sera n6gatif. 

— Si aucune erreur: 0 est renvoy6. 

— Fieid_Descriptions 

v_F{ELD_TOKEN fie!d_descriptions.FiELD_TOKEN%type: 
v_L riD field_descriptions.LTiD%type; 
v_lCONS_TOKEN field_descriptions.(CONS.TOKEN%type; 
v_DTATYP_TOKEN fleld_descriptions.DTATYP_TOKEN%type: 

" Field_Names 

v_FIELD_NAME field_names.FiELD_NAME%type; 

— Gui_Data_lndexes 

v_DATA_INDEXJCON_NAiy/IE guLdataJndexes.DATA INDEX_ICON_NAI\/1E%type: 
v«ARG_IS_TOKEN guLdataJndexe3.ARG_IS_TOKEN%lype; 
v_PARENT_DATAJNDEX_ID guLdataJndexesPARENT_DATAJNDEX_ID%type; 
v_FILTER_SPECJD guLdataJndexes.FILTER_SPECJD%type; 
v_CHILDCOUNT guLdataJndexes.CHILDCOUNT%type; 
v_DATA_TYPEJCON_NAME ICONS.ICONPATHANDNAME%type; 

— fieid_groupings 

— Autres 

v_ERROR integer; 

v__LEAF boolean; 
Erreur^lnsertion exception; 
v^UNIQJD number(IO); 

- icit enleve 
NBR^APPEL NUMBER(IO); 
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function getjfjeaf return boolean Is 
begin 

if v.DTATYP^TOKEN is not null 

and v_DTATYP_TOKEN not In (122) then 

return (true); 
else 

return (false); 

end if; 
end; 



function get_ARG_IS_TOKEN return varchar2 is 
begin 

if v^LEAF = TRUE then 
if {v_DTATYP TOKEN = 9 or v_DTATYP_TOKEN = 1 ) then - SI Lookups 

retumfY'); " 
else 

retumCN); 
end if; 
else 

return (-); 
end if; 
end; 
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— cJataindexicon 

function get^lCONPATHANDNAME (P_PARENT number) return varchar2 is 

v_TOKEN fieW^groupings.CHILD TOKEN%type; 
vJCONS icons.lCONS.TOKEN%type: 

cursor c_lcons is 
select ICONPATHANDNAME 
fronn ICONS 

where ICONS^TOKEN = vJCONS 
and ICONS_TOKEN != 0; 



cursor c^path is 
select gui.DATAJNDEXJCON^NAME 
from guLdataJndexes GUI 
where gui.datajndexjd = P_PARENT; 

vJCONPATH ICONS.ICONPATHANDNAMEVotype; 



begin 

vJCONS := vJCONS_TOKEN; 
open cjcons; 

fetch cjcons into v^lCONPATH; 

If cJcons%found then 

close cjcons; 

retum (vJCONPATH); 
else 

close cjcons: 
open cj)ath; 

fetch c _path Into vJCONPATH; 
close c _path; 
end If; 

retum (v_ICONPATH); 
end; 



function get_LTID retum number is 

begin 

If (v_DTATYP_TOKEN = 9 or v_DTATYP_TOKEN = 1 ) then - Si Loolcups 
If v_LTID = Othen 

return (-1): 
else 

return (v_LTID); 
end if; 
end If; 

return {to_number(")); 
end; 
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— datatypeicon 

function get.DATAjrYPE_ICON_NAME return varchar2 is 

cursor cjcons is 
select ICO.iCONPATHANDNAI^E 
from ICONS ICO 

,DATA_TYPES DTA 
wtiere DTA.DTATYP_TOKEN = v.DTATYP.TOKEN 
and ICO.ICONS_TOKEN = DTA. ICON S^TOKEN; 

vJCONPATH ICONS.ICONPATHANDNAIVIE%type: 

begin 
open cjcons; 

fetch cjcons into vJCONPATH; 

if c Jcons%found then 

close cjcons; 

return (v_ICONPATH); 
end if; 

close cjcons; 

return ("); 
end; 




function get_FILTER_SPECJD return number is 

cursor c_ftlter is 
select FILTER^SPEC JD 
from FILTER^SPECS 

where DTATYPJTOKEN = v_DTATYP_TOKEN; 

v_FILTER_SPECJD filter.specs.FILTER^SPEC JD%type; 

. begin 

if v_LEAF = false then 

return (to_number(")); 
else 

if v_DTATYP_TOKEN Is not null then 
open c_fllter; 

fetch cjilter into v.FILTER.SPECJD; 
close c_filter; 

return (v.FILTER.SPECJD); 
else 

return (to_number(")); 
end if; 
end if; 
end; 
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function get^CHILDCOUNT return number Is 

cursor c_chilcl is 
select count(CHILD_TOKEN) 
from FIELD_GROUPINGS 
where PARENT_TOKEN = v_FIELD_TOKEN; 

v^COMPTEUR integen 

begin 
if v_LEAF = true then 

return (to_number(")); 
else ~ 

open c_child; 

fetch c_child Into v_Compteur; 
close c_chlld; 

return (v_Compteur); 

end if: 
end; 
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» Child insertion in GUi.DATA.LOOKUPS table 

Procedure lnser1_Child (P_CHILD_TOKEN number, P_PARENT number) is 
v_NBR_INDEh4T integer; 



Recuperation de I'infformation gen6rale 

cursor C^FIELD is 
select DESCR.FIELD_TOKEN 

,DESCR.LTID 

.DESCR,ICONS_TOKEN 

,DESCR.DTATYP TOKEN 

, NAM ES. Fl ELD^NAME 
from FIELD^NAIVIES NAMES 

,FIELD_DESCRIPTIONS DESCR 
where NAMES.FIELD_TOKEN = DESCR. FIELD_TOKEN 
and NAMES.LANG_TOKEN = P_LANG_TOKEN 
and DESCR.FIELD_TOKEN = P_CHILD_TOKEN: 

- RECORD WITH FIELD TOKEN 0 IGNORED 

- ENREGISTREMENT COMPORTANT UN TOKEN 0 IGNORE. 



- Retrouver rinformation de la table FIELD.DESCRIPTIONS 



open C_FIELD; 

fetch C.FIELD into v_FIELD_TOKEN 
.V LTID 

.v„ICONS_TOKEN 
,v DTATYP^TOKEN 
,v_FIELD_NAME; 



if C_FIELD%found then 

- Insertion dans GUI.DATAJNDEXES 

v_LEAF := getjfjeaf; 

v_ARG JS^TOKEN := get_ARGJS_TOKEN: 

if v_DTATYP_TOKEN = 2 then 

v_DATA_TYPEJCON_NAME := 
else 

v_DATA_TYPEJCON_NAME := get_DATA_TYPE_ICON_NAIWE; 
end if; 

v^LTID := get^LTID; 

v^PARENT DATAJNDEXJD := P.PARENT; 

V DATA INDEX JCON_N AM E := getJCONPATHANDNAME (v_PARENT_DATAJNDEX ID); 
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v.FILTER^SPECJD := get_FILTER_SPECJD: 
v_CHILDCOUNT := get_CHILDCOUNT; 

if nvl(v_LTID,0) != -1 and v_FIELD_TOKEN is not null then 

if v_PARENT_DATA_INDEX_ID = 0 then 

v_PARENT_DATAJNDEX_ID := to_number("); 
end if; 

if v_DTATYP_TOKEN = 2 then 

v^DTATYP.TOKEN := to_number("); 
end if: 



begin 

insert into GULDATA INDEXES 
( 

DATAJNDEX ID 
.DATA INDEX^NAME 
.INDEXING 

.PRESETJNDEX_FLAG 

.DATA.INDEXJCON^NAME 

.DATA.TYPE_ICON_NAME 

.PARENT^DATAJNDEXJD 

.CHILDCOUNT 

,FILTER_SPECJD 

.LTID 

.DTATYP^TOKEN 
.ARG_lS_TOKEN 

) 

values 
( 

v^UNIQJD 
.v_FIELD_NAME 
,v_FIELD_TOKEN 

;n' 

.nvi(v_DATAJNDEX ICON^NAME/UN KNOWN.GIF*) 

,v_DATA_TYPEJCON_NAME 

.v_PARENT_DATAJNDEXJD 

.v^CHILDCOUNT 

,v_FILTER_SPECJD 

,v_LTID 

,v_DTATYP_TOKEN 
.v_ARG_lS_TOKEN 

): 

v_UNIQJD := v^UNIQJD + 1; 

EXCEPTION 
when others then 

NULL; 

* - ICIT VOIR raise Erreurjnsertion; 

end; 
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end if; 
end if; 

close C^FIELD; 
end; 
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- Search all Childs for all parents 

Procedure Get^Child (P_CHILD number. P_PARENTJD number) is 

v^CHILD number(12): 
v_PARENT number(12); 
v_BOUCLE boolean; 
v^SEQTMP number(12); 

cursor c^FIELDGRP (P^CHILD number) is 
select CHILD^TOKEN, PARENT__TOKEN 
from FIELD_NAMES NAMES 

.FIELD^DESCRIPTIONS DESCR 
.FIELD.GROUPINGS GRP 
where PARENT_TOKEN = P_CHILD 
and CHILD_TOKEN != PARENT^TOKEN 
and NAMES.FIELD^TOKEN = DESCR.FIELD^TOKEN 
and NAMES.LANG^TOKEN = P_LANG_TOKEN 
and DESCR.FIELD_TOKEN = GRP.CHILD_TOKEN 
order by GRP.DC_SEQUENCE; 

begin 
-icit 

NBR^APPEL := NBR^APPEL + 1; 
v^BOUCLE := true; 

-ICIT TEST ENLEVE EN HAS 

- IF v^UNIQJD > 3520 OR NBR APPEL > 20 then 

IF NBR^APPEL < 1 then 

- insert into sh_test values {'recursivite decouverte'); 

NULL; 
ELSE 

open c^FiELDGRP (P.CHiLD); 
while v_BOUCLE = true loop 

fetch c.FIELDGRP into v_CHILD. v^PARENT; 

if c_FIELDGRP%found then 

v_SEQTiVIP := v^UNIQJD; 

lnsert_Child (v.CHILD, P.PARENTJD); 

Get^Child (v^CHILD. v^SEQTMP); 
else 

v_BOUCLE := false; 
end if; 

end loop; 

close c^FIELDGRP; 
END IF; 
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Main Code Body 
begin 

delete from GULDATAJNDEXES 
where PRESETJNDEX^FLAG = 'N'; 
commit; 

v_UNIQJD:=1; 

- ICIT ENLEVE 
NBR_APPEL := 0; 

v^ERROR := 0; 

- Child, Parent Id 

Get_Child (0. 0): 

commit; 

- Code d'erreur pass6 en paramdtre 

v.ERROR := PRESET_DATAJNDEXES(1): 

- pas d'erreur 

return (v^ERROR); 

EXCEPTION 
when Erreurjnsertion then 

return (-10); 
when Others then 

return (-20); 

end; 
/ 




a^a numbe»(iO): 
delete from sh_l>2st: 

^aa ;^ F1£LD._NAME_DATA_!NDEXE3 (1 ). . 
insert into sh__tii£t values (aaa); 

s,r::ecA daUiJrKl-^x id. parenl_datajncle:<_;!d from guLdataJr>deKes order by 1: 
SELECT COuiMTr) FROM GULDATA_, INDEXES ; 
select " from sh__i€st: 
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Appendix B Code to remove nodes without any data 



Pour supprimer tout les noeuds n'ayant aucune sortie vers des donn^es, le processus doit s'effectuer en deux 
stapes : 

1 - Supprimer tout les enregistrements de field^groupings dont le parent_token ou (e childjoken correspond a un 
Element de donn^e (done dont le dtatyp_token est contraire de 2) et dont le parent_token ou child token ne 
retrouve sa correspondance dans data_elennents sous le field.token. * 



Delete from field_groupings where parent^token In (select fleld_token from 
neld_descriptions where not(dtatyp_token=2) and field_token not in 
(select field_token from data__elements group by fieId_token)) or 
child_token in (select field_token from fiekj_descrlptk3ns where 
not(dtatypjoken=2) and field Joken not in (select field Joken from 
data.elements group by field.token)) " 

//•*** A v6rlfler si on ne pourrait ex^cuter autrement pour 6viter les full scan tabile KOJ IN" 

2 - Retrouver h partir des root topics les bouts de branches ' 

Verifier que ces bouts sont blen des 6l6ments de donn^es. sinon les supprimer 
Retoumer au parent des bouts et verifier s'ils ont d'autres enfants 

Si oul, retrouver le bout de ces sous-branches et recommencer le processus ci-haut mentionnd 
jusqu'^ ce que tout les bouts soient des 6l6ments de donn6es (qui auront leur correspondance 
dans data_elements puisque la requete ex6cut6e en premier lieu s'en est assure) 
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Dim compteur As Long 
Dim timerjf As Long 

Private Sub Clean_The_Tree() 
compteur = 1 
Dim db As Database 
Dim rc As Recordset 
Dim rc_data_type As Recordset 
Dim strSQL As String 

•••••• All root topic from field^grouping 

strSQL = "SELECT Child_Token from Ftetd_Grouping where Parent_Token = 0 order by DC_sequence" 
Set db = OpenDatabaseChAchestmdb") 
MsgBox StrSQL 

Set rc = db.OpenRecordset(strSQL) 
While Not rc-EOF 

Call getChildCrcCChild^Token"), db) 

rc.MoveNext 
Wend 
rc. Close 
db. Close 

MsgBox "End** 
End Sub 
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Sub getChild(ByVal strFieldToken As String, db As Database) 
Dim rc_grouping As Recordset 
Dim rc_description As Recordset 
Dim strSQL As String 

Dim parent_token As Long 
Dim rc As Recordset 

StrSQL - "SELECT Child_Token from field_grouping wtiere parent token = " & strFieldToken & " order by 
DC_sequence" 

Set rc = db,OpenRecordset(strSQL) 

If there is no child for the current parent 
If rc.RecordCount = 0 Then 

Retreive the data_type of this parent 
Set rc_description s^lb.OpenRecordsetrselect dtatypjoken from field_descriptions where field token =" & 
StrFieldToken) 

If it's a structured object or a field_topic 
If rc_description(-dtatyp_token") = 12 Or rc_descriptfon("dtatypJoken") = 2 Then 
before to delete that node we have to find It's parents 
Set rc = db.OpenRecordset("seIect parent_token from field_grouping where child_token * " & 
StrFieldToken) 

If Not rc.RecordCount = 0 Then 

parent_token = rcCparent^token") 
rc.Close 

now we delete it 

db. Execute ("delete from fleld_grouplng where child_token = " & StrFieldToken) 
**** now go get his parent to see if he still with some"childs 
rc_description . Close 
Call getChild(CStr(parent_token), db) 
Else 

Exit Sub 
End If 
End If 
Else 

While Not rc.EOF 

Call getChild(rc("Child_Token"). db) 
DoEvents 
rc.MoveNext 
Wend 
rc.Close 
End If 
End Sub 

Private Sub Command 1_Click() 

Clean_The_Tree 
End Sub 
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Abstract 

This document summarises the options and strategies for personalising the behaviour of the DataCHEST 
Query Wizard according to the preferences of the individual users. 



Page 3 of 4 




User Profiles 

Every user will develop his own profile to personalise the operation of the DataCHEST Query Wizard. The User 
Profile will define a number of different behaviour characteristics, which are discussed in this document. 



Provider Priority 

In the case where a query returns multiple values for the same data element, in other words, usually when 
more than one provider has supplied the same information, the User Profile will contain instructions on how to 
the user's preferences. A sample Query result, together with some possible alternatives, are listed below: 

The query for the boiling point of ethylene glycol returns 3 values as listed below: 

t80" C Price: $ 1 .00 Provider A Royalty: S 0.50 

f 82* C Price: $ 1 .60 Provider B Royalty: $ 0.80 . 

1 78' C Price: $ 1 .40 Provider C Royalty: $ 0.70 

1. The Query will retum the first value available from the most preferred provider. The User Profile can 
identify which providers are the most desirable, in order of priority, and the Query Wizard can retum the 
appropriate result. If the provider preference is specified as alphabetic order, the query result would be : 

ISC'* C Price: $ 1 .00 Provider A Royalty: $ 0.50 

2. The Query should retum a single value with the lowest price. In this case, the least costly data element 
will be returned. In the case where more than one data element have the same price, and it is the lowest 
the Query Wizard will select the most preferred provider from the priority list described above. In this 
case the query result would again be : 

tSO" C Price: $ 1 .00 Provider A Royalty: $ 0.50 

3. The Query should retum all of the values. The user is not concerned with the usage charges and wants 
the most complete and corroborated information available. The query would cost the user $ 4.00, and the 
result and provider royalties would thus be : 

780* C Provider A: Royalty: $ 0.50 

- 782*0 Providers Royalty: $ 0.80 

- f78'C Provider C Royalty: $ 0.70 

4. The Query should retum the calculated average of all of the values. In this case, the user should be 
charged somewhat less than the cost of retrieving all of the values, and the royalty charges should be 
divided among the different providers whose data elements were used, and this on a pro-rata basis 
according to their relative value of their royalty prices. A sample calculation follows: 

The value returned should be '780* C based on an average of 3 data elements', the user should be 
charged $ 2.00 (for example), and the royalties of $ 1 .00 (again for example) should be paid as 
foDows: 

Provider A Royalty: $ 0.25 (.5/2 •$1.00) 

Providers Royalty: $ 0.40 (.8/2 •$ 1.00) 
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Provider C Royally: $ 0.35 (.7/2 *$1 .00) 

Many other forms of behaviour are possible, and wIM be implemented according to user demand. 

Format of Result Set 

The Query Result set can be returned in any of the following formats, according to the users' preference: 
HTML fonmat suitable for viewing in a browser 

- One of a variety of Flat File formats, such as SDF (Standaixi Delimited Format), 
CSV (Comma-Separated Values), etc. 

- ODBC format, suitable for direct storage Into a users' local database, such as Microsoft Access or 
SQL Server, Oracle, Dbase, FoxPro, etc. 

Other formats are not envisioned at this time, but will be evaluated If the need arises. 

Language Preferences 

The User Profile should contain the user's language preferences for 

Retrieved Data. The user can determine whether the Query Wizard will search for documents 
written in all languages or only in specified languages. 

User Interface. The user can select the language In which the User interface is launched. Once 
launched, the user can switch between desired languages, but this parameter determines the 
language initially used. . 

Character Set Preferences 

The User Profile should contain the user's character set preferences for 

Retrieved Data. The user can specify the character set to be used for the Result Set. 

Launch Point 

The User Profile should contain pointer infomriation to automatically execute specific navigation and query 
commands upon launching to bring the user to a desired status within the Query Wizard. This should 
probably be implemented as a specially named 'start-up' macro. 

Saved Requests 

The User Profile should be designed to be able to contain a potentially unlimited set of saved requests. Using 
this feature, the users will be able to save requests and execute them on a repetitive basis. The 
implementation of a scheduler is contemplated for later implementation. 
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RETRIEVE request taken from the WebLogic Trace Log and edited for clarity 

TRACE: call translateQuery {765 , 1, 'AMD', ?, ?. ?, ?, ?, ?) 
LranslateQuery out parameters: I 

5 resultRetrieve: 
SELECT 

Ol-__DATA_ELEMENTS . CAS "Substance CAS" , 
SUBST_NAMES.NAME_DESCRIPTION •Substance Narae" . 
decode ( 

10 ol_data_element8 . £ield_token« 

'10091' , 

GetDataText ( ' DATA_TEXTS ' . 1 , ' Varchar2 ' . OL_DATA_ELEMENTS . DEID) 
,null 

15 ) "Molecular formula" 

result From: FROM OL_DATA_ELEMENTS , SUBST_NAMES , SUBST_TO_NAMES 
resultWhere : 

20 WHERE ( OL_DATA_ELEMENTS .USID = SUBST_TO_NAMES . USID 

AND SUBST_TO_NAMES . LAMG^TOKEN = SUBST_NAMES . LANG_TOKEN 
AND SUBST_TO_NAMES . NAME_TOKEN = SUBST_NAMES . NAMS_TOKEN 
AND SOTST_NAMES . LANG_TOKEN« 1 
) 

25 AND ( OL_DATA_ELEMENTS . CAS = * 5 0 0 0 0 • ) 

AND ( (OL_DATA_ELEMENTS.FIELD_TOKSN =10091) ) 

resultOrderBy : ORDER BY "Substance Name" ,orderbyNum( "Substamce CAS") 

30, inaertFrom: FROM OL_DATA_ELEMENTS , SUBST_NAMES . SUBST_TO_NAMES 
insertwhere : 

WHERE ( OL_DATA_ELEMENTS.USID - SUBST_TO NAMES. USID 

AND SUBST__TO_NAh!ES.LANG_TOKEN = SUBST__NAMES , LANG_TOKEN 

AND SUBST_TO_NAMES.NAME_TOKBN « SUBST__NAMES . NAME_TOKEN 

35 AND SUBST_NAMES . LANG_TOKEN= 1 

) 

AND ( ( (OL DATA ELEMENTS . FIELD TOKEN sl0091) ) ) 



40 



TRACE: buildQueryO : ida7£5 or9UserId«21 



TRACE: call buildQuery Summary 
( 

765. 
21, 

45 'SELECT OL_DATA_ELEMENTS . CAS "Substance CAS" , 

SUBST_NAMES.NAME_DESCRIPTION "Substance Name" . ' 
decode ( 

ol_data_elements . f ield_token, 
•10091' , 

50 

GetDataText ( ' DATA_TEXTS ' , 1 , ' Varchar2 ' . OL_DATA_ELEMENTS . DEID) 
,null 

) "Molecular formula" ' , 
•FROM OL_DATA_ELEMENTS, SUBST_NAMES , SUBST__TO_NAMES • , 
55 'WHERE 1 OL_DATA_ELEMENTS.USID = SUBST_TO_NAMES . USID 

AND SUBST_TO_NAMES.LANG_TOKEN - SUBST_NAMES . LANG_TOKEN 
AND StmST_TO_NAMES.NAMB_TOKEN = SUBST__NAMES . NAME_TOKEN 
AND SUBST_NAMES.LANG_TOKEN» 1 
) 

60 AND ( OL_DATA_ELEMENTS . CAS - '50000') 

AND ( (OL_DATA_ELEMENTS . PIELD_TOKEN »10091) ) • , 
'ORDER BY "Substance Name" , orderbyNum( "Substance CAS")*, 
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FROM OL__DATA_ELEMENTS , SUBST_NAMES , SUBST_TO_MAMES ' 



•WHERE 



AND 
AND 
AND 
) 

AND 



( OL_DATA_ELEMENTS.USID~= STJBST__TO NAMES. USID 

SUBST_TO__NAMES . LANG_TOKEN « SUBST_NAMEsTlANG TOKEN 
SUBST_TO_NAMBS .NAME^TOKEN • SUBST NAMES 1 NAME TOKEN 
SUBST_NAMES.LANG_TOKEN* 1 " ~ 

( ( ( OL_DATA_ELEMENTS . FI ELD_TOKEN o 1 0 0 9 1 > ) ) • 



10 



15 



20 



25 



30 



Request as built by Application, extracted from WabLogic Trace log 



SELECT OL_DATA_ELEMENTS . CAS "Substance CAS", SUBST_NAMES . NAME DESCRIPTION 
"Substance Name", — 
decode (ol_data_elements . f ield_token j • 10091 ' , 

GetDataTextCDATA_TEXTSM, 'VarcharS' ,OL_DATA_ELEMENTS.DEID), null ) "Molecular 
formula" 

FROM OL_DATA_ELEMENTS, SXJBST_NAMES , SUBST_TO_NAMES 

WHERE { OL_DATA_ELEMENTS.USID « SUBST__TO_NAMES . USID 

AND SUBST_TO_NAMES.LANG_TOKEN « SUBST_NAMES . LANG_TOKEN 

AND SOTST_TO__NAMES.NAME_TOKEN « SUBST__NAMES . NAME TOKEN 

AND SUBST NAMES , LANQ_TOKEN» 1 ) 

AND ( OL_DATA_EL.EMENTS . CAS = * 50000') 

AND ( (OL_DATA_ELEMEKTS . FIELD_TOKEN =10091)) 

ORDER BY -Substance Name" , orderbyNum{ -Substance CAS") 



, Takes forever. 



I began this request In an Oracle SQLplus window, opened a second window, ran the segmented 
requests, formatted the results Into this Word document and added my comments. I am now 
ready to print this document, and this request is STILL running. 
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Segmented request, part 1: Get USIDs and DEIDs 



SEliECT US ID, DEID from OL_DATA_BLEMB»TS 
WHERE OL_DATA_ELEMENTS . CAS = '50000' and 
Oil DATA IlEMENTS . FIELD TOKEN = 10091 



I ... comes back quickly with 16 entries 



10 



15 



20 



25 



us ID 


DEID 


11339 


11023156 


11338 


11023258 


21795 


12167176 


21795 


12495736 


21795 


12481455 


21795 


12377614 


21795 


12377615 


21795 


12377616 


21795 


123 77617 


21795 


12377618 


21795 


12377619 


21795 


12377620 


21795 


12296642 


21795 


12506337 


21795 


12232701 


21795 


12449989 



Segmented request, part 2: Get requested data 



30 



35 



Select DC__OATA from DATA__TEXTS 

WHERE DEID In (11023156, 11023258, 12167176, 12495736, 12481455, 12377614, 
12377615, 12377616, 12377617, 1237761B, 12377619, 12377620, 12296642, 
12506337, 12232701, 12449989) 



comes back quickiy with 16 entries 



DC DATA 



40 



45 



50 



55 



CH20 
CH20 
CH20 
CH20 
CH20 
CK20 
CH20 
CH2D 
CH20 
CH20 
CH20 
CH20 

C31H31CIFN1103S 

CH20 

C12H24N20 
CH20 
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Segmented request, part 3: Get Substance Names 



Select NAME_DESCRIPTION from SUBSTNAMES, SUBST TO_NAMES 
Where USID in (11339, 11338, 21795) ^ 
and SUBST_TO_NAMES . NAME_TOKEN = SUBST_NAMES . NAME_TOKEN 
AND SUBST_TONAMES . IiANQ_TOKEN = SUBST NAMES. lANG TOKEN 



|\7. comes back quickly with 70 entries 



10 NAME DESCRIPTION 



OXOMETHANE 

PARAPORM 

OXYMETHYLEME 
15 POLYOXYMETHYLENE GLYCOLS 

SOLUTIONS KNOWN AS FORMALIN 

SUPERLYSOFORM 

TETRAOXYMETHYLENE 

TRIOXANE 
20 FA 

PANNOFORM 

IVALON 

FORMALDEHYDE ... % 
FORMALITH 

25 PORMAI^DEHYDE, SOLUTIONS, WITH NOT LESS THAN 25 PER 

FORMALINE (GERMAN) 
FORMALINA (ITALIAN) 
FORMALDEHYDE 

FORMALDEKYD (CZECH, POLISH) 
30 FORMALDEHYDE, AS FORMALIN SOLUTION (DOT) 

FYDE 
HOCH 
KARSAN 

FORMALDEHYDE (GAS) 
35 FORMALDEHYDE, SOLUTIONS, FLAMMABLE 

FORMALIN- LOESUNGEN (GERMAN) 

FORMALIN (AS FORMALDEHYDE) 

FORMIC ALDEHYDE 

FORMALIN 
40 FORMALIN 4 0 

FORMOL 

LYSOFORM 

MORBICID 

METHANAL 
45 METHYLENE OXIDE 

METHYLENE GLYCOL 

METHYL ALDEHYDE 

OPLOSSINGEN (DUTCH) 

NCI-CQ2799 
50 ALDEIDE FORMICA (ITALIAN) 

ALDEHYDE PORMIQUE (FRENCH) 

BFV 

FYDE 

FORMALDEHYDE ... % 
55 FORMALDEHYDE, SOLUTIONS, WITH NOT LESS THAN 25 PER 

FORMALDEHYDE ( GAS ) 
FORMALIN 

FORMALDEHYDE SOLUTION (FORMALIN) 
FORMALDEHYDE 
60 FORMALDEHYDE, SOLUTIONS, FLAMMABLE 

FORMALIN (AS FORMALDEHYDE) 
FORMOL 
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FORMALITH 
MORBICID 
FORnAIiDEHYDE 
Forma 1 dehyde 
5 formaldehyde 

Forma I dehyde (gas) 
FORMALDBHyDE (GAS) 
formaldehyde ... % 
Formaldehyde , solut ions 
10 Formaldehyde, solutions* flammable 

Forma 1 dehyde ; Forma 1 in 
Formalin 
FORMAJUIN 

FORMAL.IN (AS FORMALDEHYDE) 
15 Formic aldehyde 

Me thai dehyde 
Methanal 
Oxome thane 
Oxymethylene 
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CLAIMS 

Claimed is: 

1. A process for integrating data from a variety of databases each with its own content. 
5 organization and structure into a single repository, whereby the user can then retrieve 

and display the integrated data in its original form, comprising: 

a. transferring data from a source medium into a staging area; 

b. converting said data into a database format; 

c. mapping said data into a master Index Tree; 

^0 ^- normalizing or de-normalizing at least a portion of said data into a format suitable 

for parsing and integration into a target database; 

e. parsing said data to extract granular data; 

f. exporting said data of step e. into said target database; 

g. formatting said data in a normalized form; 



15 h. 



verifying said target database to ensure integrity has been maintained and 
reproduction has been accurate; 



i. archiving said target database to a storage media; 

j. merging said data into a master database repository; 

k. exporting said data from step j. to a pre-production instance; and. 

20 I. de-normalizing said data from step k. so as to optimize access time in an online 

environment. 
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